Product Assurance Run-time – Product Quality Supervision

General Description

Product Assurance Runtime objective is to perform actions to ensure the quality and suitability of a product, based on the data obtained in the manufacturing process, or in the inspection of the products once they are finished. Internally, the Product Assurance Runtime is divided in two different tasks:

  • Product Quality Prediction, which predicts the quality of the product and anticipate possible defects arising during manufacturing process: this can have a double advantage, on one hand avoiding unnecessary further manufacturing steps, on the other it can help to redefine production parameters or initial conditions (materials) as corrective actions to restore quality threshold

  • Supervision, which monitors each individual part along the entire supply chain, collecting all relevant information in the process, with the aim of identifying critical trends that might result in possible downstream defects

Both tasks are based on Artificial Intelligence (AI) and Big Data techniques. However, the approaches vary as the desired results are different, and the techniques, or the way they are applied, are also different.

The following sections focus on the description of specific features, requirements, and use of Supervision task.

The following table shows the assets available:

ResourceLocation
Source CodeLink
Online documentationProduct Assurance Run-time – Product Quality Supervision
Open API Spec (Data Processor)Link
Video

Machine Learning algorithms

Model Deployment Manager

ZDMP Virtual Academy Technical Training Webinar (from 1h 23’)

Screenshots

The following images are illustrative screen shots of the component.

Figure 1. Model Deployment Manager

Component Author(s)

Company NameZDMP AcronymWebsiteLogo
Instituto Tecnológico de InformáticaITIhttps://www.iti.es/
Ceteck Tecnológica S.L.CEThttp://www.grupoceteck.com

Commercial Information

ResourceLocation
IPR Link

ITI Product Assurance Run-time - Supervision

CET DATA PROCESSOR (CETDATA)

CET MODEL DEPLOYMENT MANAGER (CETMDM)

Marketplace Products

Product Assurance Runtime – Product Supervision ML catalogue

Model Deployment Manager

Data Processor

Architecture Diagram

The following diagram shows the position of this component in the ZDMP architecture.

Immagine che contiene screenshot Descrizione generata automaticamente

Figure 2. Position of Component in ZDMP Architecture

Benefits

General

The Supervision component has several benefits for the Data Analysts, but also for generic programmers with low knowledge about Machine Learning. It provides an abstraction layer that allows them to use well known algorithms and deploy them directly to the AI Analytics Runtime, using the Model Deployment Manager. These are the main benefits of using the Supervision component:

  • Access to predefined algorithms to solve common problems in the industry

  • Ability to customize those algorithms to fit in specific problems

  • Abstraction layer for the software engineering tasks, thanks to the integration with the AI Analytics Runtime. This includes:

    • Docker image construction

    • API wrapping

    • Data source connections (Message Bus, Storage, etc)

    • Security monitoring

Model Deployment Manager

  • Defines the manifest, which contains all the needed information to define every aspect of the model, allowing the models to be deployed through the AI Analytics Run-time

  • Friendly user interface, allowing the user to easily generate the manifest

  • Within the Model Deployment Manager interface is integrated the user interface of the Data Processor

  • The user selects the inputs and outputs needed to execute the model to upload, and at the same time configures the Data Processor transformations for the selected inputs or outputs

Data Processor

  • Composed by a set of intrinsic meaningful transformations

  • Support streaming transformations and batch transformations (configurable)

  • Integrates with AI Analytics Run-time and takes profit of AI Analytics integrations:

    • Secure Authentication and Authorization

    • Message Bus

    • Infrastructure management

    • Manifest definition

  • Reducing the coupling between components allows Data Processor to focus their efforts to add more complex data transformations

Features

Supervision Description

Figure 3. Subsystem architecture of Supervision task

The main objective of the Supervision task is to provide real time predictions regarding the result of a manufacturing process, and whether there are anomalies present or not by using Artificial Intelligence and Big Data techniques. Here one can differentiate two scenarios: Unsupervised and supervised. In the unsupervised scenario, no product quality variable is available. This can be somewhat common in SMEs because of lack of resources to perform quality tests. Another cause could be technical difficulties to obtain product quality data. In the supervised scenario there are one or more variables that measure product quality. In those cases, it is possible to relate process conditions to those quality measures using Machine Learning algorithms. The supervised scenario can be split into two cases: classification (when the quality measure is binary or categorical) and regression (that refers to the case in which the quality variable is quantitative).

On the other hand, ZDMP must be able to cope with a scenario in which large volumes of streaming data and product changes are present. In this scenario, classical batch machine learning algorithms are not the best approach. Batch learning tries to process all data simultaneously and does not integrate new data into already existing models, trying to rebuild new models from scratch. This approach is very time consuming and leads to obsolete models and large volumes of unprocessed data. On the contrary, incremental learning fits to this scenario since it continuously integrates new data into existing models, being less time and space consuming than batch learning.

Another issue to consider is the imbalanced problem. One of the objectives of ZDMP is to detect abnormal situations that could derive into faulty products. Good products are more frequent than bad ones, which, in terms of machine learning, provokes an imbalanced dataset, that is, a dataset in which the majority class is 100 or 1000 times greater than the minority class. This can result in inferior performance of the classical machine learning models because the majority class may dominate the learning process producing biased results. Some of the algorithms implemented include strategies especially designed to deal with this problem.

Every process in an industry environment has its own peculiarities. For this reason, this component provides different technologies, such as PCA, Incremental Online Learning, etc to give versatility to the final users. In addition, the data produced by every manufacturing process will vary and, depending on that, component may use a tool or another, which allows versatility.

The features to achieve the functionality of this component are itemised below and explained thereafter.

Model Deployment Manager

Figure 4. Subsystem architecture of Model Deployment Manager and Data Processor

  • Provides the user interface to ease the generation of the manifest. The manifest contains all the needed information to execute the model

  • Integrates with AI Analytics Run-time to allow the user to upload models with the expected structure within AI Analytics Run-time, and to run/stop models

  • Integrates with Data Processor to allow the user configure data transformations

  • Integrates with Product Assurance Run-time - Product Quality Prediction task and also Supervision task, to show/select the models available (which are explained in Unsupervised Anomaly Detection, Classification and Regression sections)

  • Integrates with Secure Authentication and Authorization to allow the security configuration of the model

Data Processor

  • Define a rules system to transform incoming data

  • Allows the temporally alignment of data

  • The transformations configured by the user through the Model Deployment Manager user interface are collected in the manifest; the Model Deployment Manager sends the manifest to the AI Analytics Run-time and with this information the AI Analytics Run-time calls the Data Processor API to perform the selected transformations

  • These are the data transformations implemented:

    • Scale by factor: Values are multiplied by a factor introduced by the user. It is used in conversion of units of measure and scaling data to fit into the adequate range

    • Change time zone: Data from different time zones need to be aligned. The solution is to change every timestamp from the origin time zone to a common time zone

    • Change sample frequency: Data collected with different sample frequency need to have the same frequency

    • Lagged variables: Values are delayed in time

    • Mean centering: Each value of the new data series is the deviation (positive or negative from the average value

    • Missing data imputation: Data series is cleaned of inappropriate data that can cause some models to not work properly (when the value is missing, or it is not a numeric value)

    • Extract data: This transformation is used to extract a data which is into a message (JSON), when a message contains several signals

System Requirements

Model Deployment Manager

The component may be run on any platform that supports Docker images.

Data Processor

The component may be run on any platform that supports Docker images.

Associated ZDMP services

Required

The following ZDMP components are required for the correct execution of this component:

Optional

These are the optional ZDMP components:

Installation

Model Deployment Manager installation

Model Deployment Manager installation in ZDMP environment

Once selected in the Marketplace (Model Deployment Manager), the component is available for installation through the Secure Installation component. As indicated in the Marketplace information, this component needs the Portal component to work properly. It can be installed on the ZDMP cloud platform or installed on miniZDMP. miniZDMP is a mini version of ZDMP Kubernetes Platform. The guide for the installation of miniZDMP is available at https://software.zdmp.eu//docs/components/enterprise-tier/application-runtime/#minizdmp.

Some installation variables need to be set as described below.

  • Modify the DEPLOYMENTMANAGERUI ENVIROMENT VARIABLES to match the corresponding Domain name:

    • PORTAL_API_URL: It is the Portal API

    • MODELS_URL: It is the AI Analytics Run-time URL

    • MDM_API_URL: It is the Model Deployment Manager API URL

    • DATA_PROCESSOR_URL: It is the Data Processor API URL

  • INGRESS DOMAIN NAME: Specify the corresponding domain name

Model Deployment Manager installation using Docker

This installation using Docker is only recommended when the Model Deployment Manager is going to be used as a stand-alone component.

  • Download the zip file from this link: Download

  • Create a folder named “Model Deployment Manager” and extract there the content of the zip file downloaded. The folder structure should be:

Diagrama Descripción generada automáticamente con confianza media

  • Run Docker Desktop

  • Run PowerShell, navigate through folders until folder “orchestration” and then write the command: “Docker-compose up”. Now the Model Deployment Manager is running

  • Download Secure Authentication and Authorization software from this link: Download. In the same way as in the previous steps, extract the files, run PowerShell, navigate through folders until folder “orchestration” and execute the command: “Docker-compose up”. Now the component Secure Authentication and Authorization is running

  • Download Portal from this link: Download. In the same way as the in previous step, execute “Docker-compose up”. Now the component Portal is running.

  • Install Data Processor as described in section “Data Processor installation using Docker”

Data Processor installation

Data Processor installation in ZDMP environment

Once selected in the Marketplace (Data Processor), the component is available for installation through the Secure Installation component. It can be installed on the ZDMP cloud platform or installed on miniZDMP. miniZDMP is a mini version of ZDMP Kubernetes Platform. The guide for the installation of miniZDMP is available at https://software.zdmp.eu//docs/components/enterprise-tier/application-runtime/#minizdmp.

The only variable that needs to be set during the installation is INGRESS DOMAIN NAME, with the corresponding domain name.

Data Processor installation using Docker

This installation using Docker is only recommended when the Data Processor is going to be used as a stand-alone component.

  1. Download the zip file from this link: Download

  2. Create a folder named “Data Processor” and extract there the content of the file downloaded. The folder structure should be:

  1. Run Docker Desktop

  2. Run PowerShell, navigate through folders until folder “orchestration” and then write the command: Docker-compose up

How to use

Model Deployment Manager

The Model Deployment Manager user interface allows the user to upload a model with the expected structure into the AI Analytics Runtime, selecting input and output signals needed to run the model. As anticipated in the Features section, this user interface allows configuring data transformations available in the Data Processor for each of the input and output signals. When the user completes the sequence of steps established in the Model Deployment Manager the result is a zip file that contains the model to be executed in the AI Analytics Run-time and the manifest which is a JSON file containing the data structure and information needed for the AI Analytics Run-time to execute the model.

Main screen: Your Deployments

This screen shows to the user a list of the available models (those to which the user has permission) within the AI Analytics Runtime. As shown in the image below, the user can see the status of the models (running, stopped, error).

Figure 5. Available models

In the side menu on the left, there are the following items:

  • Deployer: It opens the main screen of the Model Deployment Manager as shown in Figure 5

  • AI Analytics Run-time: It links with the AI Analytics Run-time, opening it in another browser window

  • Catalogue: It opens the Catalogue which contains the algorithms implemented for the Production Supervision subcomponent. As shown in Figure 6, the algorithms are divided into three main categories: Unsupervised, classification and regression. Unsupervised algorithms refer to the case in which a quality variable is not available. The classification algorithms are applied to the case in which the quality variable is binary (type Yes / No) or categorical (with several classes). Regression algorithms address the problem in which the quality variable is quantitative. The user can download any model from the catalogue (clicking on the corresponding “download” button) and use it as a template for the model that will be finally deployed in the AI Analytics Run-time through the Model Deployment Manager. This is described in the following “Step 2: Upload the model” section

Figure 6. Catalogue

In order to upload a new model, the user clicks on “New Deployment” as shown in the following image:

Figure 7. Creation of a new deployment

Then the configuration wizard starts, with the steps described in the following sections.

Step 1: Choose the machine learning type of the model

The user selects the machine learning type of the model, which can be Python, H2O, or Docker layers (Figure 8).

Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente

Figure 8. Machine learning model selection

Step 2: Upload the model

In this step the user uploads a zip file which contains the model with all the necessary files.

  • Python models: In case of Python based models, the uploaded zip file must have (at least) a folder named “src” and into this folder the python files “predict.py” and “train.py” (if the model needs training)

  • H2O models: In case of H2O based models, the zip file must have a folder named “src” and into this folder a “predict.jar” file. These models are created in the AI Analytics Designer, where they are trained (offline). In a way that these models do not need a training phase in the AI Analytics Runtime

  • Docker models: In case of docker based models, the zip file contains directly (no “src” folder) the Dockerfile, and other files needed to deploy the docker image. In this case there is not a training phase in the AI Analytics Runtime

The following image (Figure 9) describes the case of a python model. To explain it better, the different sections of this screen are marked and numbered in green:

Figure 9. Model uploading (Python model)

  • 1: Click on “Upload” and select from the local computer the zip file which contains the model to be uploaded, for example one previously downloaded from the Catalogue described above. As shown on the right, the compressed file must have at least a folder named “src” and into this folder the python files “train.py” and “predict.py”, for the two machine learning phases to be run in the AI Analytics Run-time, respectively training and inference. For more information about these two python codes see below the “General Considerations” section and the ones which follow, where parameters and more details for these functions are described for each machine learning algorithm available in the catalogue. If the user has an already trained model, the correspondent .pkl file must be included into the “src” folder, and the “train.py” file is no more needed: note that this is exactly the kind of output which is described in the Export Model section of the Quality Prediction Designer, and which is automatically packaged as a result of the training operation

  • 2: In this field the user types the name and version of the python third party libraries, if needed. After typing the library name and version, click on “Add” and it is added in the list “3”. In case the user puts the corresponding file “requirements.txt” (which contains the python libraries) into the “src” folder, then the user does not need to fill this section, as the libraries contained in the file “requirements.txt will be included automatically in the manifest

  • 3: All the libraries added in “2” are shown in this list. The user can delete a row if needed

Step 3: Inputs

In this step the user configures the input data sources that will feed the model. These are the possible configurations for input data channels for training and/or prediction phases (Python models):

  • Training – Historic Data: Defines the dataset to be used in the training phase

  • Prediction – REST API: Trained model will be wrapped inside a REST API, so generating an exposed endpoint to be called for prediction

  • Training/Prediction – Message Bus: Deployed model will listen to the given topics. In the prediction phase published data will be consumed by the predict function to make an inference, while in the training phase data will be used for on-line learning to release and updated version of the trained model

The screen is divided into two parts: Training phase and prediction phase. If the model does not need a training phase, the user clicks on “skip training phase” as shown in the following image, and then the training section disappears:

Figure 10. Input signals specification (Python model)

Training phase

The training phase only is available for Python based models. In Figure 11, the different steps of the training phase configuration have been marked and numbered.

Figure 11. Training phase configuration (Python model)

Below are the instructions to configure the training phase:

  • 1: To select the origin type of the input data, the user clicks on the drop down and selects one of the two available options: “Historic data” or “Message Bus”. The following instructions indicate the steps in case the user selects the origin type “Historic data”

  • 2: Once the user has selected in 1 “Historic data”, in the field “Training file name” the user types the name of the CSV file containing the dataset that will be used in the training process, which will be the first operation run in the AI Analytics Run-time after completion of the deployment steps described here. By that time, this file will need to be uploaded in the Storage of the AI Analytics Run-time as shown in Figure 12: click on “Storage” (in the side menu on the left), and then click on “Upload new dataset”

Figure 12. Uploading a dataset file to AI Analytics Run-time Storage area

  • 3: The “Hyperparameters” are parameters specific of the selected machine learning algorithms to be used in the training process. If needed, clicking on this button, a pop-up window appears allowing the user to type the hyperparameters list in JSON format

  • 4: Clicking on the “Model file” button, a pop-up windows appears allowing the user to type the name of the model file (for example trained_model.pkl) which will be generated by the training process and saved in the models area of the AI Analytics Run-time Storage. In case the “src” folder already includes a pkl file corresponding to a trained model, as described in section “Step 2: Upload the model”, then the user does not need to fill in the field “Model file” as the name of the pkl file is automatically detected and included in the manifest

  • 5: Finally, the user clicks on “Add input” and this training input is fully configured

  • 6: When this training input is fully configured is added to a list on the right side of the screen. The user can delete the configured input if needed

In case the origin type of the training input is “Message Bus” instead of “Historic data”, the configuration must follow the same steps indicated in the following section for the prediction phase. Note that this might be the case when the selected algorithm implements a continuous (incremental or adaptive) learning approach, as is the case for some of the machine learning described in the final sections.

Prediction phase with origin type Message Bus

To explain it better, in the following image the training phase has been skipped and the various parts of the screen have been marked and numbered in green:

Figure 13. Prediction phase configuration of inputs with origin type Message Bus (Python model)

These are the instructions to configure an input signal (Python model):

  • 1: To select the origin type of the input, Click on the tab. The user can select between two origin types: “Message Bus” or “Rest API”. As the title of this subsection indicates, here is explained the case of origin type “Message Bus”

  • 2: In the field “Message Bus topic name” the user types the corresponding topic

  • 3: In the section “Transformations to apply” the user clicks on the tab and a drop-down menu opens with the different transformations available in the Data Processor

  • 4: Clicking on the “Add” button, the transformation selected in “3” appears in a table. In case of selecting, for example, the transformation “scale by factor”, then the screen is as shown in Figure 14

Figure 14. Example of transformation (“scale by factor”) (Python model)

  • 5: Clicking on the “settings” symbol the user configures this transformation. The configuration of each of the transformations available in the Data Processor is explained in the section “Configuring Data Processor transformations”

  • 6: When the user has finished loading and configuring all the transformations to be applied to this input signal, then the user can click on “Add input” and this input signal is fully configured

  • 7: When the input signal is fully configured, it is added to a list on the right side of the screen as shown in Figure 15. The user can delete this configured input if needed

Figure 15. Input signal configuration (Python model)

At this point, the user can select another input signal and repeat the process until all the input signals needed are configured.

In case of a H2O based model, to configure inputs, the user can choose origin type “Message Bus” or “Rest API”. If the origin type is “Message Bus”, the user configures the inputs as described in this same section (“Prediction phase with origin type Message Bus”). If the origin type is “Rest API”, the user configures the inputs as described in “Prediction phase with origin type Rest API” section. On the other hand, the user needs to type the name of the .jar file used to make predictions as shown in Figure 16:

Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente

Figure 16. Prediction phase configuration of inputs (H2O model)

In case of a model based on Docker layers, the user defines the inputs and outputs in the Docker, so they are not configured from the Model Deployment Manager. In this step, the fields to fill are the following as shown in the Figure 17:

  • Origin type: The user selects “Message Bus” or “Rest API”

  • Environment variables: The user types the environment variables in JSON format. For example: {“var1”:”value1”,“var2”:”value2”}

  • Layers: The user types the layers to be included in JSON format. For example: [{“name”:”layername1”}, {“name”:”layername2”}]

  • Open API: This field is optional. If the user has a Rest API which consists of a single endpoint (POST), then this field does not need to be filled in. But if the user has a more complex Rest API, then this field needs to be filled in with the open API specification in JSON format.

Interfaz de usuario gráfica, Texto, Aplicación, Correo electrónico Descripción generada automáticamente

Figure 17: Prediction phase configuration of inputs (model based on Docker Layers)

Prediction phase with origin type Rest API

If the origin type of the input is “Rest API” the Prediction phase section is as shown in Figure 18:

Figure 18. Prediction phase with origin type Rest API (before configuration)

In Figure 19 the various parts of the screen have been marked and numbered in green to explain the configuration steps:

Figure 19. Prediction phase with origin type Rest API (after configuration)

  • 1: The user selects origin type “Rest API”

  • 2: The user types the name and “Description” of a parameter, then click on “Add” and this parameter is added in the list “3”

  • 3: All parameters added in “2” are shown in this list. The user can change the order in the list by clicking on the arrows up/down, and also can delete a row if needed

  • 4: When all the parameters have been added to the list, the user clicks on “Add input” and the Rest API input is fully configured

  • 5: When the Rest API is fully configured, it is added on the right side of screen. The user can delete it if needed

Configuring Data Processor transformations

This section describes how to configure the different transformations available in the Data Processor.

When configuring inputs and outputs, if needed, the user can configure data transformations. As shown in Figure 20 the user selects the data transformation in the drop-down, clicks on “Add” to add it to the list, and then clicks on the “setting” symbol to configure this transformation.

Figure 20. Data transformations

Each of the available transformations are described below:

  • Scale by factor: The values of the signal are multiplied by a factor introduced by user. It is used in conversions of units of measure or scaling data to fit into the appropriate format/range. Clicking on the “setting” symbol a pop-up window appears as shown in the following image. The user only has to type the factor (a real number with comma separated decimals)

Figure 21. Example of scale by factor transformation

  • Change time zone: It is used when data collected from different time zones need to be aligned. The user selects the origin time zone and the target time zone. The timestamps of the signal (which was collected in the origin time zone) will be changed from origin time zone to target time zone. The following image shows how the user chooses the time zones by selecting from a drop-down list that includes all the time zones in the world

Figure 22. Example of time zone change transformation

  • Change sample frequency: This transformation is used when data collected with different sample frequencies need to have the same frequency. The user only must introduce the time period between samples. As shown in the image below there is a field to enter a numeric value and another to select the time unit (second, minute, hour, etc)

Figure 23. Example of sample frequency transformation

  • Extract Data: This transformation is used to extract a data which is into a message (JSON), when a message contains several signals (for example a message than contains temperature, pressure, humidity into the same message). The user types the path to access to the value and the path to access to the timestamp. For example, if the JSON message is { ‘Temperature’:{ ‘value’:63, ‘timestamp’:‘20/12/2019 14:08:02’ }, ‘Pressure’:{ ‘value’:63, ‘timestamp’:‘20/12/2019 14:08:02’ } } and the data to extract is 63 and its timestamp then the user types “Temperature.value” and “Temperature.timestamp” as shown in Figure 24

Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente

Figure 24: Example of extract data transformation

  • Lagged variables: This transformation consists of delaying the values in time. Given a data series with values and their corresponding timestamp, the user types in “position” a number. Then the timestamps are reassigned displacing n positions the timestamps with respect to the values, so that the values are delayed in time

  • Mean centering: This transformation consists of calculating the mean of the data series and subtracting said mean from each value. So that, in the new data series, each value is the deviation (positive or negative) from the average value

  • Missing data imputation: This transformation consists of deleting a data which is not a numeric value, or the value missing. In this way, the data series is cleaned of inappropriate data that can cause some models to not work properly

Step 4: Outputs

In this step the user configures the output signals where the model will write the results. For both prediction and training phases, the only possible destination type is “Message Bus”. To configure the outputs, follow the same steps indicated in section “Prediction phase with origin type Message Bus” and in case transformations are needed on the outputs, proceed as explained in section “Configuring Data Processor transformations”. The results will be published to all the given topics: if it is a prediction message bus topic, new predictions made for each input received will be published, while if it is a training message bus topic, once the training is finished, the model file will be saved, and the training statistics will be published.

Step 5: Security

The user selects which roles, previously defined in the platform by the user responsible of managing the company profiles, will have access to the model once deployed: Only users owing the selected roles will be allowed to access.

Figure 25. Specifying model security

Step 6: Metadata

The user types the following fields, which are necessary to construct the manifest:

  • Name: Name of the model

  • Version: Version of the model (format must be a number)

  • Tags: Tags separated by commas which will be associated to the model. These tags will be used to apply filters

Figure 26. Specifying model metadata

Step 7: Confirmation

This is the last step in the wizard. The user can see in the screen the manifest to review if the data collected through different steps are correct. By clicking on “Send to AI Analytics” (Figure 28) this manifest JSON file together with the model is sent to the AI Analytics Run-time, which after a few moments, automatically will deploy the model for run time. If the user wants to download the Manifest json to his local computer, he must click on “Download” before clicking on “Send to Analytics”.

The figures below show the top of the manifest (Figure 27) and the bottom (Figure 28, when the user moves the scroll through the window).

Figure 27. Final manifest example (top)

Interfaz de usuario gráfica, Texto Descripción generada automáticamente

Figure 28: Final manifest example (bottom)

General Considerations

For ease of use and to the structure defined for the Model Deployment Manager, all the algorithms have been packaged into two main files, one for training (“train.py”) and another one for prediction (“predict.py”).

For all the algorithms, there is an example of use of both “train.py” and “predict.py” functions in the “example.py” file in the source code folder.

Unsupervised Anomaly Detection

Principal Component Analysis (PCA)

Description

Principal Component Analysis (PCA) is a technique for Unsupervised Anomaly Detection which involves decomposing the covariance matrix into eigenvalues, usually after scaling the variables to zero mean and unit variance. This transformation allows a new coordinate system to be constructed for the original data set in which the largest variance of the data set is captured on the first axis, the second largest variance is the second axis etc. Each of these axes is called principal component or latent variable.

PCA is often used as a dimensionality reduction technique, allowing the original data to be compressed into a few latent variables (principal components) and then reconstructing the data from those latent variables.

The PCA has two statistics to determine the anomaly score in the data: Squared Prediction Error (SPE) and Hotelling’s T2. Each of these statistics has its own control limit or UCL (Upper Control Limit) that allows knowing if an observation is anomalous or not.

Apart from the ability to detect anomalies, the PCA has the contribution graphs (either from the SPE or from Hotelling’s T2), which allow to know the causes of such anomalies. This ability to diagnose the anomaly is unique to the PCA and is not found in any other anomaly detection method. This is a particularly important characteristic when it comes to better understanding the data under study, that is, better understanding the mechanisms that have caused the anomalies.

The developed library provides the following features:

  • Implemented algorithms: Singular Value Decomposition (SVD) and NIPALS (Nonlinear Iterative Partial Least Squares):

    • With SVD all the components must be calculated even though the higher ones are not needed

    • NIPALS computes the components iteratively

    • SVD is faster for small datasets and NIPALS is more suitable for large datasets. The implemented algorithms cannot manage missing values

  • Automatic number of components computation; either:

    • Able to specify a determined number of components by user

    • Class will compute the optimal number of components automatically by cross validation

    • Class will compute the number of components that explain a determined R2 value specified by the user

  • Data cleansing: Create a function to obtain Normal Operating Conditions for a given dataset, by removing outliers. Possibilities implemented:

    • To remove the more extreme observations (those with values greater than 3 times the upper control limits). Able to specify a determined number of components by user

    • To perform a deeper cleansing, by removing the observations with higher values of the statistics (SPE and Hotelling’s T2) until having a 5% of out-of-control observations (which is the statistically expected percentage)

  • Fault detection: Compute SPE and T² Hotelling’s statistics and its corresponding Upper Control Limits to detect abnormal situations

  • Fault diagnosis: Compute contributions to SPE and T² Hotelling’s to identify which features are responsible of such abnormal situations

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • n_components (integer, optional, default=-1):

    • If -1 the components will be computed automatically.

    • If >0 the model will be fit with n_components

  • n_comp_criteria (float or string, optional, default=0.9): Criteria for the selection of the number of components:

    • If it is a number between [0,1] it specifies the minimum value of R2 that the model must achieve. That is, the model must have the minimum number of components that give the specified value of R2

    • If it is “Q2” then the number of components will be computed automatically by cross-validation based on Q2 trend

  • scale (boolean, optional default=True): Whether to scale the data or not. By default, data is standardized to unit variance and zero‑mean

  • cleansing (string, optional, default='none’): Must be one of the following:

    • ‘none’: No data cleansing is done at all

    • ‘extreme’: Remove observations with SPE (or T2) greater than 3*UCL

    • ‘full’: Remove observations with SPE (or T2) greater than 3*UCL and remove highest values in SPE (or T2) until having a 5% of out-of-control observations

  • algorithm (string, optional, default='SVD’): Algorithm to compute PCA. Must be one of the following:

    • ‘SVD’: Singular Value Decomposition

    • ‘NIPALS’: Nonlinear Iterative Partial Least Squares

    • Note: with SVD all the components must be calculated even though the higher ones are not needed, while NIPALS computes the components iteratively. SVD is usually faster for little datasets and NIPALS is more suitable for large datasets

  • tol_variance_filter (float, optional, default=1e-05): Removes features with coefficient of variation less than tol_variance_filter. This avoids problems with constant features or features with low variance which are not useful for the model.

When fitting the PCA model the SPE (Squared prediction error) and the Hotelling’s T² statistics are computed. The corresponding upper control limits (UCL) of both statistics are also computed. All this information is stored as properties in the PCA model object.

The training function returns two values:

  • List that contains error info if any error occurs, null otherwise

  • R2 of the predictions

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • result: predicted value with the possible values ‘Under control’, ‘Anomaly (SPE)’, ‘Anomaly (T2)’ or ‘Anomaly (SPE+T2)’

  • payload: contains the following fields:

    • UCL_SPE: Upper Control Limit for Squared Prediction Error

    • SPE: Value of Squared Prediction Error statistic

    • SPE_contrib: Dictionary with SPE contributions (one per variable)

    • UCL_T2: Upper Control Limit for T2 Hotelling’s statistic

    • T2: Value of T2 Hotelling’s statistic

    • T2_contrib: Dictionary with T2 contributions (one per variable)

PCA is able to detect abnormal situations but to know which are the features that are responsible of such abnormalities. When the observation is an anomaly, it could be because the SPE statistic is above UCL_SPE and/or because T2 is above UCL_T2. Then the corresponding contributions (SPE_contrib and/or T2_contrib) should be checked. As stated before, contributions are given as a dictionary where the key is the name of the variable, and the value is the contribution of that variable. Variables with higher contributions should be investigated to find the root causes of the anomaly.

PCA (online)

This is an extension of the Principal Component Analysis algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the PCA online are the same as in Principal Component Analysis (PCA) plus additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Autoencoder

Description

Autoencoder is a technique for Unsupervised Anomaly Detection. Autoencoders are a dimensionality reduction technique so they can be considered an alternative to PCA (Principal Component Analysis) when nonlinearities are present. The hidden layer has a much smaller number of nodes than the input layer, so the output nodes of this hidden layer can be interpreted as a reduced representation of the data. Certain simplified Autoencoder architectures have been shown to result in a reduction in dimensionality similar to that achieved with PCA. The architecture on both sides of the hidden layer is usually symmetrical. In fact, the Autoencoder can be divided into two parts on either side of the middle layer called encoder and decoder. That is to say, the Autoencoder is capable of compressing the data (encoder) and then reconstructing it (decoder). As with the PCA, the Autoencoder captures the information from most observations, which are normal observations, so when reconstructing the data, anomalous observations will have a much greater error than normal observations. This error can be used as a measure of the degree of anomaly of the observations. The implemented library is a wrapper of the Autoencoder’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • hidden_neurons (list, optional, default= [64, 32, 32, 64]): The number of neurons per hidden layers

  • hidden_activation (string, optional, default='relu’): Activation function to use for hidden layers. All hidden layers are forced to use the same type of activation. See https://keras.io/activations/

  • output_activation: (string, optional, default='sigmoid’): Activation function to use for output layer. See https://keras.io/activations/

  • loss (string or obj, optional, default=keras.losses.mean_squared_error): String (name of objective function) or objective function. See https://keras.io/losses/

  • optimizer (string, optional, default='adam’): String (name of optimizer) or optimizer instance. See https://keras.io/optimizers/

  • epochs: (integer, optional, default=100): Number of epochs to train the model

  • batch_size (integer, optional, default=32): Number of samples per gradient update

  • dropout_rate (float in (0, 1), optional, default=0.2): The dropout to be used across all layers

  • l2_regularizer (float in (0., 1), optional, default=0.1): The regularization strength of activity_regularizer applied on each layer. By default, l2 regularizer is used. See https://keras.io/regularizers/

  • validation_size (float in (0., 1), optional, default=0.1): The percentage of data to be used for validation

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

Autoencoder (online)

This is an extension of the Autoencoder algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the Autoencoder online are the same as in Autoencoder plus additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Isolation Forest

Description

Isolation Forest is a technique for Unsupervised Anomaly Detection. An Isolation Forest is an ensemble (combination) of a series of isolation trees. In an isolation tree, data is recursively partitioned by cuts parallel to the axes at cut-off points that are selected randomly and in dimensions (variables) also selected randomly. The process repeats until each node contains a single instance. The branches that contain the anomalies are shallower since the anomalous data are found in more dispersed regions than the normal data, so they are more easily separable from the rest of the observations. Therefore, the distance between the end node and the root node is used as a measure of the degree of anomaly. The definitive degree of anomaly is obtained by averaging the lengths of the branches of the observations in the different trees of the Isolation Forest. The implemented library is a wrapper of the Isolation Forest’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_estimators (integer, optional, default=100): The number of base estimators in the ensemble

  • max_samples (integer or float, optional, default="auto”): The number of samples to draw from X to train each base estimator:

    • If integer, then draw max_samples samples

    • If float, then draw max_samples * X.shape[0] samples

    • If “auto”, then max_samples=min(256, n_samples)

    • If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling)

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function

  • max_features (integer or float, optional, default=1.0): The number of features to draw from X to train each base estimator:

    • If integer, then draw max_features features

    • If float, then draw max_features * X.shape[1] features

  • bootstrap (boolean, optional, default=False): If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

Isolation Forest ASD

Description

Isolation Forest is an efficient method for anomaly detection with relatively low complexity, CPU, and time consumption. It requires all the data in the beginning to build t random samples. It also needs many passes over the dataset to build all the random forest. So, it is not adapted to data stream context. To overcome these drawbacks, the so-called Isolation Forest Algorithm for Stream Data (IForestASD) method has been proposed which is an improvement of IForest for anomaly detection in data streams. IForestASD uses sliding window to deal with streaming data. On the current complete window, IForestASD executes the standard IForest method to get the random forest. This is the IForest detector model for IForestASD. IForestASD method can also deal with concept drift in the data stream by maintaining one input desired anomaly rate (u). If the anomaly rate in the considered window is upper than u, then a concept drift occurred so IForestASD deletes its current IForest detector and builds another one using all the data in the considered window. The implemented library is a wrapper of the Isolation Forest ASD’s PySAD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (integer, optional, default=100): The list of categorical variables

  • n_initial_fit (integer, optional, default = 5): Number of instances for an initial training. The model will be trained for the first time after this number of instances. Predictions will not be available during this initial period

  • window_size (integer, optional, default=2048): The size of the reference window and its sliding

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

KitNet

Description

KitNET is a lightweight online anomaly detection algorithm based on an ensemble of autoencoders. Each autoencoder in the ensemble is responsible for detecting anomalies relating to a specific aspect of the process behaviour. KitNet is designed to run on simple machines and in real-time, so it has been designed with small memory footprint and a low computational complexity. The implemented library is a wrapper of the KitNet’s PySAD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_initial_fit (integer, optional, default=100): Number of instances for an initial training. The model will be trained for the first time after this number of instances. Predictions will not be available during this initial period

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

  • max_size_ae (integer, optional, default=10): The maximum size of any autoencoder in the ensemble layer

  • learning_rate (float in (0, 1), optional, default=0.1): The default stochastic gradient descent learning rate for all autoencoders in the KitNET instance

  • hidden_ratio (float in (0, 1), optional, default=0.75): The default ratio of hidden to visible neurons. E.g., 0.75 will cause roughly a 25% compression in the hidden layer

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

K-Nearest Neighbour (KNN)

Description

KNN is a technique for Unsupervised Anomaly Detection. In this case the degree of anomaly of an observation is based on the distance to the k‑th nearest neighbour. The implemented library is a wrapper of the KNN’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_neighbors (integer, optional, default = 5): Number of neighbours to use by default for k neighbours queries

  • method (string, optional, default='largest’). Must be one of the following:

    • ‘largest’: Use the distance to the kth neighbour as the outlier score

    • mean’: Use the average of all k neighbours as the outlier score

    • ‘median’: Use the median of the distance to k neighbours as the outlier score

  • radius (float, optional, default = 1.0): Range of parameter space to use by default for `radius_neighbors` queries

  • algorithm (string, optional, default=’auto’). Algorithm used to compute the nearest neighbours. Must be one of the following:

    • ‘ball_tree’ will use BallTree

    • ‘kd_tree’ will use KDTree

    • ‘brute’ will use a brute-force search

    • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method

  • leaf_size (integer, optional, default = 30): Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem

  • metric (string or callable, default ‘minkowski’): Metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics but is less efficient than passing the metric name as a string. Distance matrices are not supported. Valid values for metric are (see the documentation for scipy.spatial.distance for details on these metrics:

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

  • p (integer, optional, default = 2): Parameter for the Minkowski metric from scikit-learn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/scikit-learn.metrics.pairwise.pairwise_distances

  • metric_params (dict, optional, default = None): Additional keyword arguments for the metric function

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

KNN (online)

This is an extension of the K-Nearest Neighbour algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the KNN online are the same as in K-Nearest Neighbour (KNN) plus two more additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Local Outlier Factor (LOF)

Description

Local Outlier Factor (LOF) is a technique for Unsupervised Anomaly Detection which measures the local deviation of density of a given sample with respect to its neighbours. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighbourhood. More precisely, locality is given by k-nearest neighbours, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbours, one can identify samples that have a substantially lower density than their neighbours. These are considered outliers. The implemented library is a wrapper of the LOF’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_neighbors (integer, optional, default=20): Number of neighbours to use by default for `kneighbors` queries. If n_neighbors is larger than the number of samples provided, all samples will be used

  • algorithm (string, optional, default=’auto’). Algorithm used to compute the nearest neighbours. Must be one of the following:

    • ‘ball_tree’ will use BallTree

    • ‘kd_tree’ will use KDTree

    • ‘brute’ will use a brute-force search

    • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method

  • leaf_size (integer, optional, default=30): Leaf size passed to `BallTree` or `KDTree`. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem

  • metric (string or callable, default ‘minkowski’): Metric used for the distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used. If ‘precomputed’, the training input X is expected to be a distance matrix. If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics but is less efficient than passing the metric name as a string. Valid values for metric are (See the documentation for scipy.spatial.distance for details on these metrics: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html):

    • from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]

    • from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

  • p (integer, optional, default = 2): Parameter for the Minkowski metric from scikit-learn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/scikit-learn.metrics.pairwise.pairwise_distances

  • metric_params (dictionary, optional, default = None): Additional keyword arguments for the metric function

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

LOF (online)

This is an extension of the Local Outlier Factor algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. The parameters of the training function of the LOF online are the same as in Local Outlier Factor (LOF) plus two more additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Local Outlier Probability (LOP)

Description

Local Outlier Probability (LOP) is a local density-based outlier detection method which provides outlier scores in the range of [0,1] that are directly interpretable as the probability of a sample being an outlier. The outlier score measures the local deviation of density of a given sample with respect to its neighbours as Local Outlier Factor (LOF) but provides normalized outlier scores in the range [0,1]. These outlier scores are directly interpretable as a probability of an object being an outlier. Like LOF, it is local in that the anomaly score depends on how isolated the sample is with respect to the surrounding neighbourhood. Locality is given by k-nearest neighbours, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbours, one can identify samples that lie in regions of lower density compared to their neighbours and thus identify samples that may be outliers according to their Local Outlier Probability. The implemented library is a wrapper of the Local Outlier Probability’s PySAD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_initial_fit (integer, optional, default=100): Number of instances for an initial training. The model will be trained for the first time after this number of instances. Predictions will not be available during this initial period

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

  • n_neighbors (integer, optional, default=10): Number of neighbours to be used

  • extent (integer, optional, default=3): An integer value that controls the statistical extent, eg lambda times the standard deviation from the mean

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

One-Class Support Vector Machine (OCSVM)

Description

OCSVM is a technique for Unsupervised Anomaly Detection introduced by:

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001

The OCSVM try to split normal class from abnormal class by projecting the observations by means of a kernel function in an N-dimensional space so that both classes are separable by a hyperplane. The implemented library is a wrapper of the OCSVM’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • kernel (string, optional, default='rbf’): Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix

  • nu (float, optional): An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default, 0.5 will be taken

  • degree (integer, optional, default=3): Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels

  • gamma (float, optional, default='auto’): Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead

  • coef0 (float, optional, default=0.0): Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’

  • tol (float, optional): Tolerance for stopping criterion

  • shrinking (boolean, optional): Whether to use the shrinking heuristic

  • cache_size (float, optional): Specify the size of the kernel cache (in MB)

  • max_iter (integer, optional, default=-1): Hard limit on iterations within solver, or -1 for no limit

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

OCSVM (online)

This is an extension of the One-Class Support Vector Machine algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the OCSVM online are the same as in One-Class Support Vector Machine (OCSVM) plus two more additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Online Half-space trees

Description

This is a fast one-class anomaly detector for evolving data streams. It requires only normal data for training and works well when anomalous data are rare. The model features an ensemble of random Half-Space Trees, and the tree structure is constructed without any data. This makes the method highly efficient because it requires no model restructuring when adapting to evolving data streams. The implemented library is a wrapper of the scikit-multiflow Adaptive Random Forest classifier.

For more details, see:

S.C.Tan, K.M.Ting, and T.F.Liu, “Fast anomaly detection for streaming data,” in IJCAI Proceedings - International Joint Conference on Artificial Intelligence, 2011, vol. 22, no. 1, pp. 1511–1516.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • kernel (string, optional, default='rbf’): Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix

  • n_estimators(integer, optional, default=25): Number of trees in the ensemble

  • window_size(integer, optional, default=250): The window size of the stream

  • depth(integer, optional, default=15): The maximum depth of the trees in the ensemble

  • size_limit(integer, optional, default=50): The minimum mass required in a node (as a fraction of the window size) to calculate the anomaly score. A good setting is 0.1 * window_size

  • anomaly_threshold: double, optional, default=0.5): The threshold for declaring anomalies. Any instance prediction probability above this threshold will be declared as an anomaly

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

xStream

Description

xStream operates on projections of the data points, which it maintains on-the-fly, seamlessly accommodating newly emerging features. These projections are lower dimensional, fixed size, and preserve the distances between the points well. xStream performs outlier detection via density estimation, through an ensemble of randomized partitions of the data. These partitions are constructed recursively, where the data is split into smaller and smaller bins, which allows us to find outliers at different granularities. Temporal shifts are managed by a window-based approach where bin counts accumulated in the previous window are used to score points in the current window, with windows sliding forward periodically after each current window is full. The implemented library is a wrapper of the xStream’s PyOD library.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_initial_fit (integer, optional, default=100): Number of instances for an initial training. The model will be trained for the first time after this number of instances. Predictions will not be available during this initial period

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

  • num_components (integer, optional, default=100): The number of components for streamhash projection

  • n_chains (integer, optional, default=100): The number of half-space chains

  • depth (integer, optional, default=25): The maximum depth for the chains

  • window_size (integer, optional, default=25): The size (and the sliding length) of the reference window. The training function returns a list that contains error info if any error occurs, null otherwise

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • “label”: Predicted values (0-inlier,1-outlier)

  • “score”: Anomaly score

  • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

Auto model (batch learning)

Description

It implements a strategy for automatic model selection for the batch unsupervised scenario. Several outlier detection models will be fitted at the same time. The output scores of each model are combined following a major voting strategy using the results (0-inlier, 1-outlier) of each model.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • result: Contains the following items:

    • “label”: Predicted values (0-inlier,1-outlier)

    • “score”: Anomaly score

    • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

  • payload: Contains the following items:

    • “contributions”: Dictionary with contributions (one per variable) sorted from high to low values (in absolute value). These contributions correspond to those of PCA model. The algorithm computes SPE/UCL_SPE ratio and the T2/UCL_T2 ratio and the contributions of the statistic (SPE or T2) with highest ratio are taken

Auto model (online)

Description

It implements the majority vote strategy for automatic model selection for the online unsupervised scenario. Several outlier detection models are fitted at the same time. The resulting final model will use the majority vote strategy using the results (0-inlier, 1-outlier) of each model. This is a mix of online and adaptive learning algorithms. Online algorithms will be update with each new instance. For adaptive algorithms, the model will be trained after “n_update_freq” instances using the last “n_update_size” instances.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals: (list, optional, default=None): The list of categorical variables

  • n_update_freq (integer, optional, default=100): training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): the model will be trained with the last “n_update_size” instances

  • contamination (float in (0., 0.5), optional, default=0.1): The amount of contamination of the data set, ie the proportion of outliers in the data set. When fitting this is used to define the threshold on the decision function

The training function returns a list that contains error info if any error occurs, null otherwise.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • result: Contains the following items:

    • “label”: Predicted values (0-inlier,1-outlier)

    • “score”: Anomaly score

    • “limit”: Upper threshold. Scores above this limit are considered outliers (anomalies)

  • payload: Contains the following items:

    • “contributions”: Dictionary with contributions (one per variable) sorted from high to low values (in absolute value). These contributions correspond to those of PCA model. The algorithm computes SPE/UCL_SPE ratio and the T2/UCL_T2 ratio and the contributions of the statistic (SPE or T2) with highest ratio are taken

Classification

Partial Least Squares-Discriminant Analysis (PLS-DA)

Description

Discriminant Analysis using Partial Least Squares (PLS-DA) is a particular case of the PLS model where the response matrix Y is an artificial matrix that contains as many indicator (dummy) variables in columns as classes have been defined in a categorical variable. So PLS-DA is an adaptation of PLS for classification problems. The great advantage of PLS‑DA is that it is able to detect abnormal situations through the use of its two statistics (Squared Prediction Error and Hotelling’s T2). As with Principal Component Analysis and Partial Least Squares, PLS-DA is able to determine the root cause of an anomaly through the contribution plots that allow to identify which variables are responsible of such anomaly.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • response (string): The name of the response variable (categorical)

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals (list, optional default=None): The list of categorical variables.

  • n_components (integer, optional, default=-1):

    • If -1 the components will be computed automatically

    • If >0 the model will be fit with n_components

  • n_comp_criteria (float or string, optional, default=”Q2”): Criteria for the selection of the number of components:

    • If it is a list, it specifies the minimum value of R2 that the model must achieve. That is, the model must have the minimum number of components that give the specified value of R2. The list has two elements:

    • “R2X”,“R2Y”: The R2 that must have into account: X or Y

    • A number between [0,1] that specifies the value of R2(X) or R2(Y) that must be reached

    • If it is “Q2” then the number of components will be computed automatically by cross-validation based on Q2 trend

  • scale (boolean, optional default=True): Whether to scale the data or not. By default, data is standardized to unit variance and zero‑mean

  • cleansing (string, optional, default='none’): Must be one of the following:

    • ‘none’: No data cleansing is done at all

    • ‘extreme’: Removes observations with SPE (or T2) greater than 3*UCL

    • ‘full’: Removes observations with SPE (or T2) greater than 3*UCL and remove highest values in SPE (or T2) until having a 5% of out-of-control observations

  • tol_variance_filter (float, optional, default=1e-05): Removes features with coefficient of variation less than tol_variance_filter. This avoids problems with constant features or features with low variance which are not useful for the model

When fitting the PLS model the SPE (Squared prediction error) and the Hotelling’s T² statistics are computed. The corresponding upper control limits (UCL) of both statistics are also computed. All this information is stored as properties in the PLS model object.

The training function returns de R2 computed for the response variable predictions.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • model (PLS object): A model previously trained

  • json_data (string): A JSON with data to predict

The results of the prediction function are a JSON with the following structure:

  • result: Contains the following fields:

    • status: Value with the possible values ‘Under control’, ‘Anomaly (SPE)’, ‘Anomaly (T2)’ or ‘Anomaly (SPE+T2)’

    • prediction: The predicted value for the response variable

  • payload: Contains the following fields:

    • UCL_SPE: Upper Control Limit for Squared Prediction Error

    • SPE: Value of Squared Prediction Error statistic

    • SPE_contrib: Dictionary with SPE contributions (one per variable)

    • UCL_T2: Upper Control Limit for T2 Hotelling’s statistic

    • T2: Value of T2 Hotelling’s statistic

    • T2_contrib: Dictionary with T2 contributions (one per variable)

PLS-DA is able to detect abnormal situations and to know which are the features that are responsible of such abnormalities. When the observation is an anomaly, it could be because the SPE statistic is above UCL_SPE and/or because T2 is above UCL_T2. Then the corresponding contributions (SPE_contrib and/or T2_contrib) should be checked. As stated before, contributions are given as a dictionary where the key is the name of the variable, and the value is the contribution of that variable. Variables with higher contributions should be investigated in order to find the root causes of the anomaly.

PLS-DA (online)

This is an extension of the Partial Least Squares-Discriminant Analysis (PLS-DA) algorithm to work in online mode. To achieve this, a sliding window technique will be used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the PLS-DA online are the same as in Partial Least Squares-Discriminant Analysis (PLS-DA) plus two more additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Adaptive Random Forest classifier

Description

The three most important aspects of Adaptive Random Forest are:

  • Inducing diversity through re-sampling

  • Inducing diversity through randomly selecting subsets of features for node splits

  • Drift detectors per base tree, which cause selective resets in response to drifts

It also allows training background trees, which start training if a warning is detected and replace the active tree if the warning escalates to a drift. The implemented library is a wrapper of the scikit-multiflow Adaptive Random Forest classifier.

For more details, see:

Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabricio Enembreck, Bernhard Pfharinger, Geoff Holmes, Talel Abdessalem. Adaptive random forests for evolving data stream classification. In Machine Learning, DOI: 10.1007/s10994-017-5642-8, Springer, 2017.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): Number of trees in the ensemble

  • max_features (integer, float, string, or None, optional, default="auto”): Max number of attributes for each node split:

    • If integer, then consider max_features features at each split

    • If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split

    • If “auto”, then max_features=sqrt(n_features)

    • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”)

    • If “log2”, then max_features=log2(n_features)

    • If None, then max_features=n_features

  • performance_metric (string, optional, default="acc”): Metric used to track trees performance within the ensemble. It should be one of the following:

    • ‘acc’ – Accuracy

    • ‘kappa’ - Accuracy

  • grace_period (integer, optional, default=50): Number of instances a leaf should observe between split attempts

  • split_criterion (string, optional, default='info_gain’): Split criterion to use:

    • ‘gini’ – Gini

    • ‘info_gain’ - Information Gain

  • split_confidence (float, optional, default=0.01): Allowed error in split decision, a value closer to 0 takes longer to decide

  • tie_threshold (float, optional, default=0.05): Threshold below which a split will be forced to break ties

  • binary_split (boolean, optional, default=False): If True, only allow binary splits

  • stop_mem_management (boolean, optional, default=False): If True, stop growing as soon as memory limit is hit

  • no_preprune (boolean, optional, default=False): If True, disable pre-pruning

  • leaf_prediction (string, optional, default='nba’): Prediction mechanism used at leaves:

    • ‘mc’ - Majority Class

    • ‘nb’ - Naive Bayes

    • ‘nba’ - Naive Bayes Adaptive

  • nb_threshold(integer, optional, default=0): Number of instances a leaf should observe before allowing Naive Bayes

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Online Boosting ensemble classifier

Description

Online Boosting ensemble classifier is the online version of the boosting ensemble method (AdaBoost). AdaBoost focuses more on difficult examples. The misclassified examples by the current classifier are given more weights in the training set of the following learner. In the online context, since there is no training dataset, but a stream of samples, the drawing of samples with replacement cannot be trivially executed. The strategy adopted by the Online Boosting algorithm is to simulate this task by training each arriving sample K times, which is drawn by the binomial distribution. This online ensemble learner method is improved by the addition of an ADWIN change detector. ADWIN stands for Adaptive Windowing. It works by keeping updated statistics of a variable sized window, so it can detect changes and perform cuts in its window to better adapt the learning algorithms. The implemented library is a wrapper of the scikit-multiflow Online Boosting ensemble classifier.

For more details, see:

B. Wang and J. Pineau, “Online Bagging and Boosting for Imbalanced Data Streams,” in IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3353-3366, 1 Dec. 2016. doi: 10.1109/TKDE.2016.2609424

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): The size of the ensemble, in other words, how many classifiers to train

  • drift_detection (boolean, optional, default=True): A drift detector (ADWIN) can be used by the method to track the performance of the classifiers and adapt when a drift is detected

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: As response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

KNN classifier

Description

k-Nearest Neighbors classifier is a non-parametric classification method that keeps track of the last max_window_size training samples. The predicted class-label for a given query sample is obtained in two steps:

  • Find the closest n_neighbors to the query sample in the data window

  • Aggregate the class-labels of the n_neighbors to define the predicted class for the query sample

The implemented library is a wrapper of the scikit-multiflow KNN classifier. A modification has been developed in order to treat categorical variables.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_neighbors (integer, default=5): The number of nearest neighbors to search for.

  • max_window_size (integer, default=1000): The maximum size of the window storing the last observed samples

  • leaf_size (integer, default=30) scikit-learn.KDTree parameter: The maximum number of samples that can be stored in one leaf node, which determines from which point the algorithm will switch for a brute-force approach. The bigger this number the faster the tree construction time, but the slower the query time will be

  • metric (string or scikit-learn.DistanceMetric object) scikit-learn.KDTree parameter: The distance metric to use for the KDTree. Default=’euclidean’

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

KNN with Adaptive Windowing classifier

Description

K-Nearest Neighbors classifier with Adaptive Windowing (ADWIN) change detector is an improvement from the regular k-Nearest Neighbor classifier, as it is resistant to concept drift. It utilises the ADWIN change detector to decide which samples to keep and which ones to forget, and by doing so it regulates the sample window size. It uses the regular KNNClassifier as a base class, with the major difference that this class keeps a variable size window, instead of a fixed size one and also it updates the ADWIN algorithm as new data is available.

The implemented library is a wrapper of the scikit-multiflow KNN ADWIN classifier. A modification has been developed in order to treat categorical variables.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_neighbors (integer, default=5): The number of nearest neighbours to search for

  • max_window_size (integer, default=1000): The maximum size of the window storing the last viewed samples

  • leaf_size (integer, default=30): The maximum number of samples that can be stored in one leaf node, which determines from which point the algorithm will switch for a brute-force approach. The bigger this number the faster the tree construction time, but the slower the query time will be

  • metric (string or scikit-learn.DistanceMetric object) scikit-learn.KDTree parameter: The distance metric to use for the KDTree. Default=’euclidean’

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Leveraging Bagging ensemble classifier

Description

The complete description of this method can be found in:

A. Bifet, G. Holmes, and B. Pfahringer, “Leveraging Bagging for Evolving Data Streams,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2010, no. 1, pp. 135–150.

The bagging performance is leveraged by increasing the re-sampling and by using output detection codes. It uses a Poisson distribution to simulate the re-sampling process. To increase re-sampling a higher value of the w parameter of the Poisson distribution is used, which is 6 by default. With this value the input space diversity is increased, by attributing a different range of weights to the samples. This algorithm uses output detection codes. This consists of coding each label with a n bit long binary code and then associating n classifiers, one to each bit of the binary codes. At each new sample analysed, each classifier is trained on its own bit. This allows, to some extent, the correction of errors. To deal with concept drift it uses the ADWIN algorithm, one instance for each classifier. Each time a concept drift is detected the worst ensemble’s classifier is reset, which is done by comparing the ADWIN window sizes.

The implemented library is a wrapper of the scikit-multiflow Leveraging Bagging ensemble classifier.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, default=10): The size of the ensemble, in other words, how many classifiers to train

  • W (integer, default=6): The poisson distribution’s parameter, which is used to simulate re-sampling

  • delta (float, default=0.002): The delta parameter for the ADWIN change detector

  • enable_code_matrix (boolean, default=False): If set, enables Leveraging Bagging MC using Random Output Codes

  • leverage_algorithm (string, default='leveraging_bag’): The bagging algorithm to use. Can be one of the following:

    • ‘leveraging_bag’ - Leveraging Bagging using ADWIN

    • ‘leveraging_bag_me’ - Assigns to a sample weight=1 if \ misclassified, otherwise weight=error/(1-error)

    • ‘leveraging_bag_half’ - Use resampling without replacement for half of the instances

    • ‘leveraging_bag_wt’ - Without taking out all instances

    • ‘leveraging_subag’ - Using resampling without replacement

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

SMOTE Bagging classifier

Description

Online SMOTEBagging is the online version of the ensemble method SMOTEBagging described in:

B. Wang and J. Pineau, “Online Bagging and Boosting for Imbalanced Data Streams,” in IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3353-3366, 1 Dec. 2016. doi: 10.1109/TKDE.2016.2609424

This method works by re-sampling the negative class with replacement rate at 100%, while N positive examples are generated for each base learner, among which a specific percentage of them are created by re-sampling and the rest of the examples are created by the synthetic minority oversampling technique (SMOTE). This technique is especially useful for imbalanced problems, in which one of the classes has a very low representation. This online ensemble learner method is improved by the addition of an ADWIN change detector. ADWIN stands for Adaptive Windowing. It works by keeping updated statistics of a variable sized window, so it can detect changes and perform cuts in its window to better adapt the learning algorithms.

The implemented library is a wrapper of the scikit-multiflow Leveraging Bagging ensemble classifier.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): The size of the ensemble, in other words, how many classifiers to train

  • sampling_rate (integer, optional, default=1): The sampling rate of the positive instances

  • drift_detection (boolean, optional, default=True): A drift detector (ADWIN) can be used by the method to track the performance of the classifiers and adapt when a drift is detected

The training function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Online Under-Over-Bagging ensemble classifier

Description

Online UnderOverBagging is the online version of the ensemble method UnderOverBagging described in:

B. Wang and J. Pineau, “Online Bagging and Boosting for Imbalanced Data Streams,” in IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3353-3366, 1 Dec. 2016. doi: 10.1109/TKDE.2016.2609424

This technique is especially useful for imbalanced problems, in which one of the classes has a very low representation. In this case, UnderOverBagging uses the strategy of under-sampling the majority class and oversampling the minority class. In addition, the sampling rate can be also varied over the bagging iterations, which further boosts the diversity of the base learners. This online ensemble learner method is improved by the addition of an ADWIN change detector. ADWIN stands for Adaptive Windowing. It works by keeping updated statistics of a variable sized window, so it can detect changes and perform cuts in its window to better adapt the learning algorithms.

The implemented library is a wrapper of the scikit-multiflow Online Under-Over-Bagging ensemble classifier.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): The size of the ensemble, in other words, how many classifiers to train

  • sampling_rate (integer, optional, default=1): The sampling rate of the positive instances

  • drift_detection (boolean, optional, default=True): A drift detector (ADWIN) can be used by the method to track the performance of the classifiers and adapt when a drift is detected

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Online RUSBoost ensemble classifier

Description

Online RUSBoost ensemble classifier is the adaptation of the RUSBoost ensemble learner to data streams. RUSBoost follows the techniques of UnderOverBagging and SMOTEBagging by introducing sampling techniques as a postprocessing step before each iteration of the standard AdaBoost algorithm.

This online ensemble learner method is improved by the addition of an ADWIN change detector. ADWIN stands for Adaptive Windowing. It works by keeping updated statistics of a variable sized window, so it can detect changes and perform cuts in its window to better adapt the learning algorithms.

For more details, see:

B. Wang and J. Pineau, “Online Bagging and Boosting for Imbalanced Data Streams,” in IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3353-3366, 1 Dec. 2016. doi: 10.1109/TKDE.2016.2609424

The implemented library is a wrapper of the scikit-multiflow Online RUSBoost ensemble classifier.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): The size of the ensemble, in other words, how many classifiers to train

  • sampling_rate (integer, optional, default=3): The sampling rate of the positive instances

  • algorithm (integer, optional, default=1): The implementation of RUSBoost to use {1, 2, 3}

  • drift_detection (boolean, optional, default=True): A drift detector (ADWIN) can be used by the method to track the performance of the classifiers and adapt when a drift is detected

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Precision

    • Recall

    • F1-score

    • Confusion matrix

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

XGBoost classifier

Description

XGBoost stands for “Extreme Gradient Boosting”. In recent years, it has been the main driving force behind the algorithms that win massive ML competitions. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve many data science problems in a fast and accurate way. It has the following features:

  • Regularized Learning: Regularization term helps to smooth the final learnt weights to avoid over-fitting. The regularized objective will tend to select a model employing simple and predictive functions

  • Gradient Tree Boosting: The tree ensemble model cannot be optimized using traditional optimization methods in Euclidean space. Instead, the model is trained in an additive manner

  • Shrinkage and Column Subsampling: Besides the regularized objective, two additional techniques are used to further prevent overfitting. The first technique is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each tree and leaves space for future trees to improve the model

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (categorical)

  • categoricals (list, optional, default=None): The list of categorical variables

  • b_auto_tuning (boolean, optional, default=True): Wether to perform automatic hyperparameter tuning or not. If False, default parameters will be used

  • b_optimization (boolean, optional, default=False): Indicates whether the model will be used for optimization (zDigitalTwin)

  • pca_model (PCA object, optional, default=None): A pretrained PCA model that will be used during optimization (zDigitalTwin)

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • Accuracy

    • Precision

    • Recall

    • F1-score

    • Note: as response variable can be multiclass, metrics are computed with the average='macro’ parameter option when using the corresponding functions of scikit-learn.metrics

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

  • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

  • raw_data: The data used for the prediction

Auto model (online)

Description

Uses historical data to evaluate the performance of the models which is computed using the last 30% of the data. There are three main types of models, depending on the training strategy:

  • Batch learning methods: These methods are trained with the first 70% of data and then the trained model is used for testing the last 30%. The model is never updated

  • Adaptive learning methods: In this case the model is trained after “n_update_freq” instances using the last “n_update_size” instances

  • Online learning methods: The model is updated with every new instance

All models are compared using F1-score metric and the best model is selected. This model will be used for predicting future observations.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • response (string): The name of the response variable (categorical)

  • response_classes (list): A list with the classes for the response variable

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

The training function returns two values:

  • A string in JSON format containing the metrics (precision, recall and F1-score) of the best model

  • A list that contains error info if any error occurs, null otherwise

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • If the model is PLS-DA then the JSON contains the following:

    • result: Contains the following fields:

      • status: Whether the observation is ‘Under control’, ‘Anomaly (SPE)’, ‘Anomaly (T2)’, ‘Anomaly (SPE+T2)’

      • prediction: The predicted value

    • payload: Contains the following fields:

      • UCL_SPE: Upper Control Limit for Squared Prediction Error

      • SPE: Value of Squared Prediction Error

      • SPE_contrib: Dictionary with SPE contributions (one per variable)

      • UCL_T2: Upper Control Limit for T2 Hotelling statistic

      • T2: Value of T2 Hotelling statistic

      • T2_contrib: Dictionary with T2 contributions (one per variable)

  • Otherwise, the JSON contains the following:

    • id: Data identifier (index)

    • timestamp: Time when the prediction is done

    • result: Contains the following

    • “label”: Name of the response variable

    • value: The probability for the prediction

    • payload: a list of records (one per each category) with the following information:

    • name: The category name

    • value: The probability for the category

    • raw_data: The data used for the prediction

Regression

Partial Least Squares (PLS)

Description

Partial Least Squares (PLS) is a technique for Supervised Machine Learning for regression problems. PLS is a useful multivariate statistical technique to build predictive models between 2 data matrices: X containing the information of the explanatory variables and Y with the information of the variables answer to predict. Unlike linear regression, the PLS is capable of constructing a single predictive model to predict a set of response variables at a time and can be used in the presence of missing data and collinearity in the variables of the matrix X, as well as when the number of observations is less than the number of variables. The PLS tries to find the sources of variability of X and Y that are related. The great advantage of PLS is that it is able to detect abnormal situations through the use of its two statistics (Squared Prediction Error and Hotelling’s T2). Like Principal Component Analysis and Partial Least Squares-Discriminant Analysis, PLS is able to determine the root cause of an anomaly through the contribution plots that allow to identify which variables are responsible of such anomaly.

The developed library provides the following features:

  • Implement the Modified kernel algorithm 2 as described in Improved PLS Algorithms

  • Automatic number of components computation; either:

    • Able to specify a determined number of components by user

    • Class will compute the optimal number of components automatically by cross validation

    • Class will compute the number of components that explain a determined R2 value specified by the user

  • Data cleansing: Creates a function to obtain Normal Operating Conditions for a given dataset, by removing outliers. Two possibilities have been implemented:

    • To remove the more extreme observations (those with values greater than 3 times the upper control limits). Able to specify a determined number of components by user

    • To perform a deeper cleansing, by also removing the observations with higher values of the statistics (SPE and Hotelling’s T2) until having a 5% of out-of-control observations (which is the statistically expected percentage)

  • Fault detection: Computes SPE and T² Hotelling’s statistics and its corresponding Upper Control Limits to detect abnormal situations

  • Fault diagnosis: Computes contributions to SPE and T² Hotelling’s to identify which features are responsible of such abnormal situations

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • response (string): The name of the response variable (numerical)

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals (list, optional default=None): The list of categorical variables

  • n_components (integer, optional, default=-1):

    • If -1 the components will be computed automatically

    • If >0 the model will be fit with n_components

  • n_comp_criteria (float or string, optional, default=”Q2”): Criteria for the selection of the number of components:

    • If it is a list, it specifies the minimum value of R2 that the model must achieve. That is, the model must have the minimum number of components that give the specified value of R2. The list has two elements:

      • “R2X”,“R2Y”: The R2 that must have into account: X or Y

      • A number between [0,1] that specifies the value of R2(X) or R2(Y) that must be reached

    • If it is “Q2” then the number of components will be computed automatically by cross-validation based on Q2 trend

  • scale (boolean, optional default=True): Whether to scale the data or not. By default, data is standardized to unit variance and zero‑mean

  • cleansing (string, optional, default='none’): Must be one of the following:

    • ‘none’: No data cleansing is done at all

    • ‘extreme’: Removes observations with SPE (or T2) greater than 3*UCL

    • ‘full’: Removes observations with SPE (or T2) greater than 3*UCL and remove highest values in SPE (or T2) until having a 5% of out-of-control observations

  • tol_variance_filter (float, optional, default=1e-05): Removes features with coefficient of variation less than tol_variance_filter. This avoids problems with constant features or features with low variance which are not useful for the model

When fitting the PLS model the SPE (Squared prediction error) and the Hotelling’s T² statistics are computed. The corresponding upper control limits (UCL) of both statistics are also computed. All this information is stored as properties in the PLS model object.

The training function returns de R2 computed for the response variable predictions.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • model (PLS object): A model previously trained

  • json_data (string): A JSON with data to predict

The results of the prediction function are a JSON with the following structure:

  • result: Contains the following fields:

    • status: Value with the possible values ‘Under control’, ‘Anomaly (SPE)’, ‘Anomaly (T2)’ or ‘Anomaly (SPE+T2)’

    • prediction: The predicted value for the response variable

  • payload: contains the following fields:

    • UCL_SPE: Upper Control Limit for Squared Prediction Error

    • SPE: Value of Squared Prediction Error statistic

    • SPE_contrib: Dictionary with SPE contributions (one per variable)

    • UCL_T2: Upper Control Limit for T2 Hotelling’s statistic

    • T2: Value of T2 Hotelling’s statistic

    • T2_contrib: Dictionary with T2 contributions (one per variable)

PLS is able not only to detect abnormal situations, but also to know which are the features that are responsible of such abnormalities. When the observation is an anomaly, it could be because the SPE statistic is above UCL_SPE and/or because T2 is above UCL_T2. Then the corresponding contributions (SPE_contrib and/or T2_contrib) should be checked. As stated before, contributions are given as a dictionary where the key is the name of the variable, and the value is the contribution of that variable. Variables with higher contributions should be investigated in order to find the root causes of the anomaly.

PLS (online)

This is an extension of the Partial Least Squares algorithm to work in online mode. To achieve this, a sliding window technique is used. Two parameters will be incorporated: the frequency (number of instances) to update the model and the size of the window to use for such update. So, the parameters of the training function of the PLS online are the same as in Partial Least Squares (PLS) plus two more additional parameters:

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

Online Hoeffding Adaptive Tree regressor

Description

A Hoeffding Tree or Very Fast Decision Tree (VFDT) is the standard decision tree algorithm for data stream classification. VFDT uses the Hoeffding bound to decide the minimum number of arriving instances to achieve certain level of confidence in splitting the node. This confidence level determines how close the statistics between the attribute chosen by VFDT and the attribute chosen by decision tree for batch learning. Hoeffding Adaptive Tree regressor is an adaptation of the aforementioned algorithm to be used in regression problems. This version uses ADWIN to detect drift. ADWIN (ADaptive WINdowing) is an adaptive sliding window algorithm for detecting change and keeping updated statistics about a data stream. ADWIN allows algorithms not adapted for drifting data, to be resistant to this phenomenon. The implemented library is a wrapper of the scikit-multiflow Online Hoeffding Adaptive Tree regressor.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (numerical)

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • grace_period (integer, default=200): Number of instances a leaf should observe between split attempts

  • split_confidence (float, default=0.0000001): Allowed error in split decision, a value closer to 0 takes longer to decide

  • tie_threshold (float, default=0.05): Threshold below which a split will be forced to break ties

  • binary_split (boolean, default=False): If True, only allow binary splits

  • stop_mem_management (boolean, default=False): If True, stop growing as soon as memory limit is hit

  • remove_poor_atts (boolean, default=False): If True, disable poor attributes

  • no_preprune (boolean, default=False): If True, disable pre-pruning

  • leaf_prediction (string, default='perceptron’): Prediction mechanism used at leaves:

    • ‘mean’ - Target mean

    • ‘perceptron’ - Perceptron

  • learning_ratio_perceptron (float): The learning rate of the perceptron

  • learning_ratio_decay (float): Decay multiplier for the learning rate of the perceptron

  • learning_ratio_const (boolean): If False the learning ratio will decay with the number of examples seen

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • R2

    • Mean Squared Error

    • Root Mean Squared Error

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • payload: Contains the following information:

    • name: Name of the response variable

    • value: Predicted value for the response variable

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Online KNN regressor

Description

This is an adaptation of the KNN classifier to be used in regression problems. This works by storing a buffer with the `window_size` (algorithm parameter) most recent observations. A brute-force search is used to find the `n_neighbors` nearest observations in the buffer to make a prediction. The implemented library is a wrapper of the scikit-multiflow Online KNN regressor.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (numerical)

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_neighbors (integer, default=5): The number of nearest neighbours to search for

  • grace_period (integer, default=200): First number of instances to discard when computing metrics

  • max_window_size (integer, default=1000): The maximum size of the window storing the last observed samples

  • leaf_size (integer, default=30) scikit-learn.KDTree parameter: The maximum number of samples that can be stored in one leaf node, which determines from which point the algorithm will switch for a brute-force approach. The bigger this number the faster the tree construction time, but the slower the query time will be

  • metric (string or scikit-learn.DistanceMetric object) scikit-learn.KDTree parameter: The distance metric to use for the KDTree. Default=’euclidean’

  • aggregation_method (string, default='mean’): The method to aggregate the target values of neighbours: ‘mean’ or ‘median’

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • R2

    • Mean Squared Error

    • Root Mean Squared Error

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict.

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • payload: Contains the following information:

    • name: Name of the response variable

    • value: Predicted value for the response variable

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

Adaptive Random Forest regressor

Description

This implementation is slightly different from the original algorithm proposed in:

Gomes, H.M., Barddal, J.P., Boiko, L.E., Bifet, A., 2018. Adaptive random forests for data stream regression. ESANN 2018.

The Adaptive Random Forest regressor (ARF) is an ensemble technique which uses Hoeffding Tree Regressor as base learner It also adds a new strategy to monitor the incoming data and check for concept drifts. The monitored data (either the trees’ errors or their predictions) are centred and scaled (z-score normalization) to have zero mean and unit standard deviation. Transformed values are then again normalized in the [0, 1] range to fulfil ADWIN’s requirements. It is assumed that the data subjected to the z-score normalization lies within the interval of the mean +/-3σ, as it occurs in normal distributions.

The implemented library is a wrapper of the Adaptive Random Forest regressor.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (numerical)

  • categoricals (dictionary): A dictionary with the categorical X variables and the classes specified manually (by the user) for each categorical. This is useful for online learning algorithms because all possible classes are not available from the very beginning

  • n_estimators (integer, optional, default=10): Number of trees in the ensemble

  • max_features (integer, float, str or None, optional, default="auto”): Max number of attributes for each node split:

    • If int, then consider max_features features at each split

    • If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split

    • If “auto”, then max_features=sqrt(n_features)

    • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”)

    • If “log2”, then max_features=log2(n_features)

    • If None, then max_features=n_features

  • lambda_value (integer, optional, default=6): The lambda value for bagging (lambda=6 corresponds to Leverage Bagging)

  • aggregation_method (string, optional, default='median’): The method to use to aggregate predictions in the ensemble: ‘mean’ or ‘median’

  • weighted_vote_strategy (string or None, optional, default=None): Metric used to weight individual tree’s responses when aggregating them. Only used when aggregation_method='mean’. Possible values are:

    • None: Do not assign weights to individual tree’s predictions. Use the arithmetic mean instead

    • ‘mse’: Weight predictions using trees’ Mean Square Error

    • ‘mae’: Weight predictions using trees’ Mean Absolute Error

  • grace_period (integer, optional, default=50): Number of instances a leaf should observe between split attempts

  • split_confidence (float, optional, default=0.01): Allowed error in split decision, a value closer to 0 takes longer to decide

  • tie_threshold (float, optional, default=0.05): Threshold below which a split will be forced to break ties

  • binary_split (bool, optional, default=False): If True, only allow binary splits

  • stop_mem_management (bool, optional, default=False): If True, stop growing as soon as memory limit is hit

  • remove_poor_atts (bool, optional, default=False): If True, disable poor attributes

  • no_preprune (boolean, optional, default=False): If True, disable pre-pruning

  • leaf_prediction (string, optional, default='perceptron’): Prediction mechanism used at leaves:

    • ‘mean’ - Target mean

    • ‘perceptron’ - Perceptron

  • learning_ratio_perceptron (float, default=0.1): The learning rate of the perceptron

  • learning_ratio_decay (float, default=0.001): Decay multiplier for the learning rate of the perceptron

  • learning_ratio_const (bool, default=True): If False the learning ratio will decay with the number of examples seen

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • R2

    • Mean Squared Error

    • Root Mean Squared Error

  • s_error (string): A list containing the information about the possible error.

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • payload: Contains the following information:

    • name: Name of the response variable

    • value: Predicted value for the response variable

  • raw_data: The data used for the prediction

This is an incremental learning method so after a prediction step and when the value of the response variable is available then the model must be updated by calling the training method using this new observation.

XGBoost regressor

Description

XGBoost stands for “Extreme Gradient Boosting”. In recent years, it has been the main driving force behind the algorithms that win massive ML competitions. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve many data science problems in a fast and accurate way. It has the following features:

  • Regularized Learning: Regularization term helps to smooth the final learnt weights to avoid over-fitting. The regularized objective will tend to select a model employing simple and predictive functions

  • Gradient Tree Boosting: The tree ensemble model cannot be optimized using traditional optimization methods in Euclidean space. Instead, the model is trained in an additive manner

  • Shrinkage and Column Subsampling: Besides the regularized objective, two additional techniques are used to further prevent overfitting. The first technique is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each tree and leaves space for future trees to improve the model

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • response (string): The name of the response variable (numerical)

  • categoricals (list, optional, default=None): The list of categorical variables

  • b_auto_tuning (boolean, optional, default=True): Whether to perform automatic hyperparameter tuning or not. If False, default parameters will be used

  • b_optimization (boolean, optional, default=False): Indicates whether the model will be used for optimization (zDigitalTwin)

  • pca_model (PCA object, optional, default=None): A pretrained PCA model that will be used during optimization (zDigitalTwin)

The train function returns two parameters:

  • json_metrics (string): A JSON with the following metrics:

    • RMSE

    • MSE

    • R2

  • s_error (string): A list containing the information about the possible error

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The predict function returns a JSON with the following information:

  • id: Data identifier (index)

  • timestamp: Time when the prediction is done

  • payload: Contains the following information:

    • name: Name of the response variable

    • value: Predicted value for the response variable

  • raw_data: The data used for the prediction

Auto model (online)

Description

Uses historical data to evaluate the performance of the models which is computed using the last 30% of the data. There are three main types of models, depending on the training strategy:

  • Batch learning methods: These methods are trained with the first 70% of data and then the trained model is used for testing the last 30%. The model is never updated

  • Adaptive learning methods: In this case the model is trained after “n_update_freq” instances using the last “n_update_size” instances

  • Online learning methods: The model is updated with every new instance

All models are compared using R-squared and the best model is selected. This model will be used for predicting future observations.

Training

The parameters of the training function are the following:

  • json_data (string): A JSON with data to train

  • s_model (string): The path where the model will be stored

  • response (string): The name of the response variable (numerical)

  • predictors (list, optional, default=None): The list of variables to consider for model training

  • categoricals (list, optional, default=None): The list of categorical variables

  • n_update_freq (integer, optional, default=100): Training frequency, that is, the model will be trained after “n_update_freq” instances

  • n_update_size (integer, optional, default=500): The model will be trained with the last “n_update_size” instances

The training function returns two values:

  • A string in JSON format containing the metrics (R-squared, MSE and RMSE) of the best model

  • A list that contains error info if any error occurs, null otherwise

Predict

Once the model is trained, predictions can be obtained using the “predict.py” module that has the following parameters:

  • json_data (string): A JSON with data to predict

  • s_model (string): The path where the model is stored

The results of the prediction function are a JSON with the following structure:

  • If the model is PLS-DA then the JSON contains the following:

    • result: Contains the following fields:

    • status: Whether the observation is:

      • ‘Under control’

      • ‘Anomaly (SPE)’

      • ‘Anomaly (T2)’

      • ‘Anomaly (SPE+T2)’

    • prediction: The predicted value

    • payload: Contains the following fields:

      • UCL_SPE: Upper Control Limit for Squared Prediction Error

      • SPE: Value of Squared Prediction Error

      • SPE_contrib: Dictionary with SPE contributions (one per variable)

      • UCL_T2: Upper Control Limit for T2 Hotelling statistic

      • T2: Value of T2 Hotelling statistic

      • T2_contrib: Dictionary with T2 contributions (one per variable)

  • Otherwise, the JSON contains the following:

    • id: Data identifier (index)

    • timestamp: Time when the prediction is done

    • payload: Contains the following information:

      • name: Name of the response variable

      • value: Predicted value for the response variable

    • raw_data: The data used for the prediction

Last modified October 26, 2023