Getting Started¶
Dependencies¶
- Python 3.8 - 3.14
- numpy >= 1.16.3
- python-utils >= 2.3.0
- (optional) numba >= 0.45.1
- (optional) scipy >= 1.3.0
Numba just-in-time (JIT) compiles the function which calculates the Euclidean distance between observations, providing a reduction in computation time (significantly when a large number of observations are scored). Numba is not a requirement and PyNomaly may still be used solely with numpy if desired.
When scipy is available, PyNomaly uses its optimized distance
computation (scipy.spatial.distance.cdist) and error function (scipy.special.erf)
implementations for additional performance gains.
Installation¶
Install from the Python Package Index:
Or from conda-forge:
Quick Start¶
from PyNomaly import loop
m = loop.LocalOutlierProbability(data).fit()
scores = m.local_outlier_probabilities
print(scores)
where data is a NxM (N rows, M columns; 2-dimensional) set of data as either a Pandas DataFrame or Numpy array.
LocalOutlierProbability sets the extent (an integer value of 1, 2, or 3) and n_neighbors (must be greater than 0) parameters with the default values of 3 and 10, respectively:
from PyNomaly import loop
m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20).fit()
scores = m.local_outlier_probabilities
print(scores)
Using Cluster Labels¶
This implementation of LoOP includes an optional cluster_labels parameter. This is useful in cases where regions of varying density occur within the same set of data. When using cluster_labels, the Local Outlier Probability of a sample is calculated with respect to its cluster assignment.
from PyNomaly import loop
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.6, min_samples=50).fit(data)
m = loop.LocalOutlierProbability(
data, extent=2, n_neighbors=20, cluster_labels=list(db.labels_)
).fit()
scores = m.local_outlier_probabilities
print(scores)
Note
Unless your data is all the same scale, it may be a good idea to normalize your data with z-scores or another normalization scheme prior to using LoOP, especially when working with multiple dimensions of varying scale. Users must also appropriately handle missing values prior to using LoOP, as LoOP does not support Pandas DataFrames or Numpy arrays with missing values.