Semi-Supervised Outlier Detection Using PyOD

With Porosity, Permeability, and Density data as case examples

Dekha
Artificial Intelligence in Plain English

--

Photo by Will Myers on Unsplash

Outlier Detection

Outlier detection is a crucial step in many data science problems in any field. A lot of outlier detection methods have developed and applied in real life, ranging from univariate descriptive statistics to the use of machine learning/deep learning for multivariate outlier detection.

In this article, we will focus on utilizing various machine learning methods to perform outlier detection in multivariate data. PyOD is the main library used in this article because of its ease in applying various methods according to their respective characteristics.

This article will also introduce Semi Supervised method for anomaly detection, including its application on porosity, permeability and density data.

Semi Supervised Machine Learning

Surprisingly, Semi Supervised Machine Learning is not quite popular among data scientist community. In general, Semi Supervised works based on the training data that describing normal behavior to predict outlier in the testing dataset. Below is the main idea of how semi supervised anomaly detection works:

  • Training data consists only of observations describing normal behavior.
  • The model is fit on training data and then used to evaluate new observations.
  • This approach is taken when outliers are defined as points differing from the distribution of the training data.
  • Any new observations differing from the training data within a threshold, even if they form a high-density region, are considered outliers.

PyOD

PyOD is one of the most scalable & comprehensive libraries in performing anomaly detection on multivariate data. There are more than 35 models that can be used in PyOD. In general, these models can be grouped into 6 types:

We will compare 4 of the 6 groups above by taking 1 Semi Supervised model example of each group to be applied to porosity, permeability, and density data from Volve datasets (Graph based and Neural Network methods are not included due to longer running time).

Practical Application

We divide the dataframe into 2 parts: df_clean (dataframe with no outliers — used as training data) & df_mix (dataframe with outliers — used as testing data).

# import standard library
import pandas as pd
import plotly.express as px

# import porosity, permeability, density data
df = pd.read_csv('volve_pordenperm.csv')
df.dropna(axis=0, inplace=True)
df.reset_index(inplace=True)

# split data
df_clean = df.loc[:80]
df_mix = df.loc[80::]

We compare the crossplot of porosity, permeability, & density data on df_clean and df_mix using plotly.express.

# df_clean crossplot

fig = px.scatter(df_clean, x="porosity", y="permeability", color="density")
fig.show()

fig = px.scatter(df_clean, x="porosity", y="density", color="permeability")
fig.show()

fig = px.scatter(df_clean, x="density", y="permeability", color="porosity")
fig.show()
# df_mix crossplot

fig = px.scatter(df_mix, x="porosity", y="permeability", color="density")
fig.show()

fig = px.scatter(df_mix, x="porosity", y="density", color="permeability")
fig.show()

fig = px.scatter(df_mix, x="density", y="permeability", color="porosity")
fig.show()
Figure 1. Porosity, Permeability, & Density Crossplot in both dataframes.

Figure 1 shows that the two dataframes show a similar data distribution and no values are too far from the data midpoint. We will analyze the differences in the df_mix data outlier classification from 4 different methods of semi supervised machine learning.

from pyod.models.knn import KNN # Proximity-Based
from pyod.models.kde import KDE # Probabilistic
from pyod.models.pca import PCA # Linear Model
from pyod.models.iforest import IForest # Outlier Ensembles

X_train = df_clean[['porosity','permeability','density']]
X_test = df_mix[['porosity','permeability','density']]

kNN (k-Nearest Neighbor)

kNN identify outlier based on the n-neighbors and calculates weight for each distance to its neighbors as the outlier score. Weights can be calculated as largest, mean, or median. The default of weight calculation method is ‘largest’.

clf = KNN()
clf.fit(X_train)
y_pred_knn = clf.predict(X_test)

df_mix['KNN'] = y_pred_knn
df_mix["KNN"] = df_mix["KNN"].astype(str)

fig = px.scatter(df_mix, x="porosity", y="permeability", color="KNN",
hover_data=['density']
)
fig.show()

fig = px.scatter(df_mix, x="porosity", y="density", color="KNN",
hover_data=['permeability']
)
fig.show()

fig = px.scatter(df_mix, x="density", y="permeability", color="KNN",
hover_data=['porosity']
)
fig.show()
Figure 2. Outlier identification result of KNN method.

Figure 2 shows that the KNN outlier identification result tends to be biased on permeability data. It can be seen that the 1 (outlier) and 0 (non-outlier) values are still cannot be differentiated in the density and porosity crossplot. This is happened because the permeability distance is way too far compared to the value of permeability & density (maximum permeability value> 250) causing a high kNN outlier score on the permeability data only.

KDE (Kernel Density Estimation)

Moving on to second method, KDE is an outlier method that works based on the density calculation of each point. Outlier is detected by comparing the local density of each point to the local density of its neighbors. Bandwidth specifications is an important hyperparameter in KDE, it specifies how much weight is given to distance calculation. The larger bandwidth, the more influential are the neighbors that are further away.

clf = KDE()
clf.fit(X_train)
y_pred_kde = clf.predict(X_test)

df_mix['KDE'] = y_pred_kde
df_mix["KDE"] = df_mix["KDE"].astype(str)

fig = px.scatter(df_mix, x="porosity", y="permeability", color="KDE",
hover_data=['density']
)
fig.show()

fig = px.scatter(df_mix, x="porosity", y="density", color="KDE",
hover_data=['permeability']
)
fig.show()

fig = px.scatter(df_mix, x="density", y="permeability", color="KDE",
hover_data=['porosity']
)
fig.show()
Figure 3. Outlier identification result of KDE method.

Figure 3 shows that the KDE method results are overestimated in determining outliers, the red dots (outliers) have far more numbers than the blue dots (non-outliers). Overestimation problem can be caused by the bandwidth hyperparameter. Figure 4 shows the visualization if we choose a larger bandwidth.

Figure 4. Outlier identification result of KDE method with higher bandwidth.

PCA (Principal Component Analysis)

PCA is a linear dimensionality reduction that decompose data into eigenvectors and eigenvalues. The eigenvectors with high eigenvalues capture most of the variance of the data. Data that has eigenvectors with small eigenvalues can be considered as outlier since the hyperplane constructed with this eigenvector is different from normal data points.

clf = PCA()
clf.fit(X_train)
y_pred_pca = clf.predict(X_test)

df_mix['PCA'] = y_pred_pca
df_mix["PCA"] = df_mix["PCA"].astype(str)

fig = px.scatter(df_mix, x="porosity", y="permeability", color="PCA",
hover_data=['density']
)
fig.show()

fig = px.scatter(df_mix, x="porosity", y="density", color="PCA",
hover_data=['permeability']
)
fig.show()

fig = px.scatter(df_mix, x="density", y="permeability", color="PCA",
hover_data=['porosity']
)
fig.show()
Figure 5. Outlier identification result of PCA method.

Unlike the results of KNN and KDE, the results of PCA tends to produce clearer outlier separation of the three variables (porosity, permeability, and density). Eigenvectors accommodate all variables evenly in linear space, determine outliers, then transform the data back to the original domain. In addition to default standardization process of PCA model in PyOD, PCA determines outliers based on the small variance value captured in eigenvectors from each variable which reduces bias due to far distance in one particular variable.

Isolation Forest

Isolation Forest is a tree-based method for determining outliers. The algorithm consists of two steps:

  1. Builds isolation trees using recursive partioning algorithm on sub-samples of training set until each point is isolated
  2. Computes anomaly score on new points

The outliers tend to have a much shorter length of isolation tree (high anomaly score) compared to normal data.

clf = IForest()
clf.fit(X_train)
y_pred_iforest = clf.predict(X_test)

df_mix['IForest'] = y_pred_iforest
df_mix["IForest"] = df_mix["IForest"].astype(str)

fig = px.scatter(df_mix, x="porosity", y="permeability", color="IForest",
hover_data=['density']
)
fig.show()

fig = px.scatter(df_mix, x="porosity", y="density", color="IForest",
hover_data=['permeability']
)
fig.show()

fig = px.scatter(df_mix, x="density", y="permeability", color="IForest",
hover_data=['porosity']
)
fig.show()
Figure 6. Outlier identification result of IForest method.

Similar to PCA, Isolation Forest produces a clear separation on density data (Figure 6). Just like any other tree-based machine learning method, Isolation Forest has less-dependent on the variable distances, the outliers are determined based on the logic of tree leaf and tree depth that are less influenced by a super far distance of one particular variable.

Final Words

Even though PCA and Isolation Forest results tend to be similar, it can be seen that each method has its own characteristics to identify and split outlier. We cannot determine which method is better or worse, since each method has its own unique way of working. What we need to learn and examine is knowing how each of these methods works and adapting it to the suitable problem.

References

Angiulli, F., & Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. Lecture Notes in Computer Science, 15–27. doi:10.1007/3–540–45681–3_2.

Latecki, L. J., Lazarevic, A., & Pokrajac, D. (n.d.). Outlier Detection with Kernel Density Functions. Lecture Notes in Computer Science, 61–75. doi:10.1007/978–3–540–73499–4_6.

Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining. doi:10.1109/icdm.2008.17

Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., & Chang, L. (n.d.). Principal Component-based Anomaly Detection Scheme. Studies in Computational Intelligence, 311–329. doi:10.1007/11539827_18.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

--

--

Indonesian Data Scientist | Energy Professional | Mostly tell stories about Earth Science and Data Analytics 🌏📈 | https://www.linkedin.com/in/mordekhai/