Skip to content

Examples

Iris Data Example

We'll be using the well-known Iris dataset to show LoOP's capabilities. You'll need:

  • matplotlib 2.0.0 or greater
  • PyDataset 0.2.0 or greater
  • scikit-learn 0.18.1 or greater

First, let's import the packages and libraries we will need.

from PyNomaly import loop
import pandas as pd
from pydataset import data
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Create two sets of Iris data for scoring; one with clustering and the other without.

iris = pd.DataFrame(data('iris').drop(columns=['Species']))

Cluster the data using DBSCAN and generate two sets of scores. In both cases, we will use the default values for both extent (3) and n_neighbors (10).

db = DBSCAN(eps=0.9, min_samples=10).fit(iris)
m = loop.LocalOutlierProbability(iris).fit()
scores_noclust = m.local_outlier_probabilities
m_clust = loop.LocalOutlierProbability(iris, cluster_labels=list(db.labels_)).fit()
scores_clust = m_clust.local_outlier_probabilities

Organize the data into two separate Pandas DataFrames.

iris_clust = pd.DataFrame(iris.copy())
iris_clust['scores'] = scores_clust
iris_clust['labels'] = db.labels_
iris['scores'] = scores_noclust

Visualize the scores provided by LoOP in both cases (with and without clustering).

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'],
c=iris['scores'], cmap='seismic', s=50)
ax.set_xlabel('Sepal.Width')
ax.set_ylabel('Petal.Width')
ax.set_zlabel('Sepal.Length')
plt.show()

LoOP Scores without Clustering

LoOP Scores without Clustering

LoOP Scores with Clustering

LoOP Scores with Clustering

DBSCAN Cluster Assignments

DBSCAN Cluster Assignments

Note the differences between using LocalOutlierProbability with and without clustering. In the example without clustering, samples are scored according to the distribution of the entire data set. In the example with clustering, each sample is scored according to the distribution of each cluster. Which approach is suitable depends on the use case.

Note

Data was not normalized in this example, but it's probably a good idea to do so in practice.

Distance Metric Comparison

The example in examples/iris_dist_grid.py demonstrates scoring with several different distance metrics by providing custom distance and neighbor matrices:

LoOP Scores by Distance Metric

LoOP Scores by Distance Metric

Streaming Data

The example in examples/stream.py demonstrates the streaming approach using the Iris dataset. See the User Guide for a complete walkthrough.

LoOP Scores using Stream Approach with n=10

LoOP Scores using Stream Approach

Additional Examples

The following example scripts are available in the examples/ directory of the repository:

Script Description
iris.py Iris dataset with and without clustering
iris_dist_grid.py Comparing multiple distance metrics
stream.py Streaming data approach
numba_speed_diff.py Numba vs. pure Python speed comparison
parallel_benchmark.py Parallel processing benchmarks with n_jobs
multiple_gaussian_2d.py 2D Gaussian mixture data
1d_time_series.py 1-dimensional time series anomaly detection
cluster_labels_flipped.py Flipped cluster labels consistency check