Semi-Supervised Outlier Detection Using PyOD
With Porosity, Permeability, and Density data as case examples
Outlier Detection
Outlier detection is a crucial step in many data science problems in any field. A lot of outlier detection methods have developed and applied in real life, ranging from univariate descriptive statistics to the use of machine learning/deep learning for multivariate outlier detection.
In this article, we will focus on utilizing various machine learning methods to perform outlier detection in multivariate data. PyOD is the main library used in this article because of its ease in applying various methods according to their respective characteristics.
This article will also introduce Semi Supervised method for anomaly detection, including its application on porosity, permeability and density data.
Semi Supervised Machine Learning
Surprisingly, Semi Supervised Machine Learning is not quite popular among data scientist community. In general, Semi Supervised works based on the training data that describing normal behavior to predict outlier in the testing dataset. Below is the main idea of how semi supervised anomaly detection works:
- Training data consists only of observations describing normal behavior.
- The model is fit on training data and then used to evaluate new observations.
- This approach is taken when outliers are defined as points differing from the distribution of the training data.
- Any new observations differing from the training data within a threshold, even if they form a high-density region, are considered outliers.
PyOD
PyOD is one of the most scalable & comprehensive libraries in performing anomaly detection on multivariate data. There are more than 35 models that can be used in PyOD. In general, these models can be grouped into 6 types:
- Proximity-Based, example: K-nearest Neighbors.
- Probabilistic, example: Kernel Density Estimation.
- Outlier Ensembles, example: Isolation Forest.
- Linear Model, example: Principal Component Analysis.
- Graph-based, example: Rgraph.
- Neural Network, example: Variational AutoEncoder.
We will compare 4 of the 6 groups above by taking 1 Semi Supervised model example of each group to be applied to porosity, permeability, and density data from Volve datasets (Graph based and Neural Network methods are not included due to longer running time).
Practical Application
We divide the dataframe into 2 parts: df_clean
(dataframe with no outliers — used as training data) & df_mix
(dataframe with outliers — used as testing data).
# import standard library
import pandas as pd
import plotly.express as px
# import porosity, permeability, density data
df = pd.read_csv('volve_pordenperm.csv')
df.dropna(axis=0, inplace=True)
df.reset_index(inplace=True)
# split data
df_clean = df.loc[:80]
df_mix = df.loc[80::]
We compare the crossplot of porosity, permeability, & density data on df_clean
and df_mix
using plotly.express
.
# df_clean crossplot
fig = px.scatter(df_clean, x="porosity", y="permeability", color="density")
fig.show()
fig = px.scatter(df_clean, x="porosity", y="density", color="permeability")
fig.show()
fig = px.scatter(df_clean, x="density", y="permeability", color="porosity")
fig.show()
# df_mix crossplot
fig = px.scatter(df_mix, x="porosity", y="permeability", color="density")
fig.show()
fig = px.scatter(df_mix, x="porosity", y="density", color="permeability")
fig.show()
fig = px.scatter(df_mix, x="density", y="permeability", color="porosity")
fig.show()
Figure 1 shows that the two dataframes show a similar data distribution and no values are too far from the data midpoint. We will analyze the differences in the df_mix
data outlier classification from 4 different methods of semi supervised machine learning.
from pyod.models.knn import KNN # Proximity-Based
from pyod.models.kde import KDE # Probabilistic
from pyod.models.pca import PCA # Linear Model
from pyod.models.iforest import IForest # Outlier Ensembles
X_train = df_clean[['porosity','permeability','density']]
X_test = df_mix[['porosity','permeability','density']]
kNN (k-Nearest Neighbor)
kNN identify outlier based on the n-neighbors and calculates weight for each distance to its neighbors as the outlier score. Weights can be calculated as largest, mean, or median. The default of weight calculation method is ‘largest’.
clf = KNN()
clf.fit(X_train)
y_pred_knn = clf.predict(X_test)
df_mix['KNN'] = y_pred_knn
df_mix["KNN"] = df_mix["KNN"].astype(str)
fig = px.scatter(df_mix, x="porosity", y="permeability", color="KNN",
hover_data=['density']
)
fig.show()
fig = px.scatter(df_mix, x="porosity", y="density", color="KNN",
hover_data=['permeability']
)
fig.show()
fig = px.scatter(df_mix, x="density", y="permeability", color="KNN",
hover_data=['porosity']
)
fig.show()
Figure 2 shows that the KNN outlier identification result tends to be biased on permeability data. It can be seen that the 1 (outlier) and 0 (non-outlier) values are still cannot be differentiated in the density and porosity crossplot. This is happened because the permeability distance is way too far compared to the value of permeability & density (maximum permeability value> 250) causing a high kNN outlier score on the permeability data only.
KDE (Kernel Density Estimation)
Moving on to second method, KDE is an outlier method that works based on the density calculation of each point. Outlier is detected by comparing the local density of each point to the local density of its neighbors. Bandwidth specifications is an important hyperparameter in KDE, it specifies how much weight is given to distance calculation. The larger bandwidth, the more influential are the neighbors that are further away.
clf = KDE()
clf.fit(X_train)
y_pred_kde = clf.predict(X_test)
df_mix['KDE'] = y_pred_kde
df_mix["KDE"] = df_mix["KDE"].astype(str)
fig = px.scatter(df_mix, x="porosity", y="permeability", color="KDE",
hover_data=['density']
)
fig.show()
fig = px.scatter(df_mix, x="porosity", y="density", color="KDE",
hover_data=['permeability']
)
fig.show()
fig = px.scatter(df_mix, x="density", y="permeability", color="KDE",
hover_data=['porosity']
)
fig.show()
Figure 3 shows that the KDE method results are overestimated in determining outliers, the red dots (outliers) have far more numbers than the blue dots (non-outliers). Overestimation problem can be caused by the bandwidth hyperparameter. Figure 4 shows the visualization if we choose a larger bandwidth.
PCA (Principal Component Analysis)
PCA is a linear dimensionality reduction that decompose data into eigenvectors and eigenvalues. The eigenvectors with high eigenvalues capture most of the variance of the data. Data that has eigenvectors with small eigenvalues can be considered as outlier since the hyperplane constructed with this eigenvector is different from normal data points.
clf = PCA()
clf.fit(X_train)
y_pred_pca = clf.predict(X_test)
df_mix['PCA'] = y_pred_pca
df_mix["PCA"] = df_mix["PCA"].astype(str)
fig = px.scatter(df_mix, x="porosity", y="permeability", color="PCA",
hover_data=['density']
)
fig.show()
fig = px.scatter(df_mix, x="porosity", y="density", color="PCA",
hover_data=['permeability']
)
fig.show()
fig = px.scatter(df_mix, x="density", y="permeability", color="PCA",
hover_data=['porosity']
)
fig.show()
Unlike the results of KNN and KDE, the results of PCA tends to produce clearer outlier separation of the three variables (porosity, permeability, and density). Eigenvectors accommodate all variables evenly in linear space, determine outliers, then transform the data back to the original domain. In addition to default standardization process of PCA model in PyOD, PCA determines outliers based on the small variance value captured in eigenvectors from each variable which reduces bias due to far distance in one particular variable.
Isolation Forest
Isolation Forest is a tree-based method for determining outliers. The algorithm consists of two steps:
- Builds isolation trees using recursive partioning algorithm on sub-samples of training set until each point is isolated
- Computes anomaly score on new points
The outliers tend to have a much shorter length of isolation tree (high anomaly score) compared to normal data.
clf = IForest()
clf.fit(X_train)
y_pred_iforest = clf.predict(X_test)
df_mix['IForest'] = y_pred_iforest
df_mix["IForest"] = df_mix["IForest"].astype(str)
fig = px.scatter(df_mix, x="porosity", y="permeability", color="IForest",
hover_data=['density']
)
fig.show()
fig = px.scatter(df_mix, x="porosity", y="density", color="IForest",
hover_data=['permeability']
)
fig.show()
fig = px.scatter(df_mix, x="density", y="permeability", color="IForest",
hover_data=['porosity']
)
fig.show()
Similar to PCA, Isolation Forest produces a clear separation on density data (Figure 6). Just like any other tree-based machine learning method, Isolation Forest has less-dependent on the variable distances, the outliers are determined based on the logic of tree leaf and tree depth that are less influenced by a super far distance of one particular variable.
Final Words
Even though PCA and Isolation Forest results tend to be similar, it can be seen that each method has its own characteristics to identify and split outlier. We cannot determine which method is better or worse, since each method has its own unique way of working. What we need to learn and examine is knowing how each of these methods works and adapting it to the suitable problem.
References
Angiulli, F., & Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. Lecture Notes in Computer Science, 15–27. doi:10.1007/3–540–45681–3_2.
Latecki, L. J., Lazarevic, A., & Pokrajac, D. (n.d.). Outlier Detection with Kernel Density Functions. Lecture Notes in Computer Science, 61–75. doi:10.1007/978–3–540–73499–4_6.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining. doi:10.1109/icdm.2008.17
Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., & Chang, L. (n.d.). Principal Component-based Anomaly Detection Scheme. Studies in Computational Intelligence, 311–329. doi:10.1007/11539827_18.
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
Interested in scaling your software startup? Check out Circuit.