Naive Bayes for Data Science — With Python

--

Naive Bayesian method is one of the efficient algorithms for classification purpose.

Many business problems require automating decisions. For example, what is the churn likelihood for a given customer? What is the likelihood of a click on an ad for a given customer? and so many alike. These are categorised as classification problems which itself is a part of larger topic called supervised learning. Most of the classification problems have an outcome which takes only two different values. These classification problems are known as binary classification. Some examples of binary outcomes are phishing/not-phishing, click/don’t click, churn/don’t churn. Even in the case of more than two outcomes, the problem can often be recast into a series of binary problems using conditional probabilities.

Classification

There are many solutions proposed for classification purposes. Most of them share one common approach. Calculate the probability that a given sample belongs to a specific class. After that it is more subjective to decide if the given probability is an indication of class membership which is derived by cut-off threshold. This threshold is mainly determined by the utility function or risk-aversion policies. For example, in one application classifying a sample as class 1 could be more favourable than the other application with the same probability. So, in many classification problems, it is more subjective to decide on final outcome.

Among them, naive Bayes is very efficient in implementing a classification solution specially when dealing with large data size. The core of the Naive Bayes is the famous Bayes rule:

Bayes rule

Using the Bayes rule, you actually can attack reversely as in many cases, the information is more likely available in reverse. By reverse, we mean changing the order of the output and input. For example, if we want to calculate the probability of A given B, we can tackle this by looking at scaled version of the probability of B given A. Usually, the data set contains the probability of B given A and we are looking to find out the probability of A given B.

Bayes rule tells us that look at the data and see what the frequency of a given class among the records with the same features is. After deriving the class likelihood for a given features, decide which class the record belongs to. This is the most obvious thing that comes to the mind in dealing with the classification problem. Intuitively, you want all the members of a given class to have the similar features. From mathematical point of view, we have the formula:

Bayes rule for binary classification

As you can see, you want to calculate the probability of a given class for a given set of features. So, you go through the cases with the same features and count the number of times where the class is i (you have the output labels). Then you can say that how many times out of total cases with the same feature the class belongs to i.

Note that the Naive Bayes is for problems with categorical features. In order to apply it to the numerical features, we should use some methods to convert them into categorical variables.

However, there is a caveat in using the mentioned approach. In practice, we come up with situations where the number of features for records are massive. In these cases, finding the records with the same features is very hard and rare event if not impossible. This is very common phenomenon called curse of dimensionally. As a rule of thumb, adding a feature with five different possibilities, reduce the similarity probability by 5.

So, what is the solution?

The solution is that we better off set ourselves free from looking at the data with the same features and try to deduce the same probability based on whole data. If we unbind the features and treat them as independent variables, we can broaden our data scope. This is the intuition behind using the naive Bayes. In other words, we assume that for a given class the features are independent. Of course, this is a strong assumption and that is why it is called naive Bayes but it is very effective and very well cured approximation. By doing this, we actually want to switch to the formula below:

Naive Bayes

This is a simpler formula but with high efficacy.

Putting Into Practice

In order to demonstrate the feasibility of the Naive Bayes, we have adopted a data set to apply the classification method. We adopt the loan data set provided in https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/data. The file name is loan_data.csv.gz . We would like to predict whether if the individual will default or pays off. We limit ourselves to only 3 out of 19 features as some of the features are numerical type. Most of the code parts are provided in [1].

from sklearn.naive_bayes import MultinomialNBloan_data = pd.read_csv('loan_data.csv.gz')# convert to categoricalloan_data.outcome = loan_data.outcome.astype('category')
loan_data.outcome.cat.reorder_categories(['paid off', 'default'])
loan_data.purpose_ = loan_data.purpose_.astype('category')
loan_data.home_ = loan_data.home_.astype('category')
loan_data.emp_len_ = loan_data.emp_len_.astype('category')
predictors = ['purpose_', 'home_', 'emp_len_']
outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='')
y = loan_data[outcome]
naive_model = MultinomialNB(alpha=0.01, fit_prior=True)naive_model.fit(X, y)
new_loan = X.loc[146:146, :]
print('predicted class: ', naive_model.predict(new_loan)[0])probabilities = pd.DataFrame(naive_model.predict_proba(new_loan),
columns=naive_model.classes_)
print('predicted probabilities',)
print(probabilities)

The output will be

As can be seen, the probability that the record number 146 belongs to default class and paid off are 65% and 34%, respectively. We can also obtain the feature probability for each class through the code below:

feature_prob = np.exp(1) ** naive_model.feature_log_prob_

The output would be

Please note that the MultinomialNB provides the logarithm of probability. I order to obtain the actual probability we should use exponential base.

The MultinomialNB uses some sort of smoothness tuned by the alpha parameter. It also try to use priori distribution just like Bayesian inference which prior is the most distinctive element compared to the counterpart frequency inference.

Wrap-up

In this short introductory, we briefly presented the Naive Bayesian and the concept behinds it. We also implement a classification method using the Naive Bayesian method.

Reference

[1] Bruce, Peter, Andrew Bruce, and Peter Gedeck. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. O’Reilly Media, 2020.

--

--

Technology enthusiast, Futuristic, Telecommunications, Machine learning and AI savvy, work at Dolby Inc.