Problem Statement

The objective of this project is to develop a credit card fraud detection system using the dataset from Kaggle. The dataset contains a large number of credit card transactions, with both fraudulent and non-fraudulent cases. The goal is to build a machine learning model that can accurately identify fraudulent transactions to help financial institutions mitigate the risk and minimize losses due to fraudulent activities.

About Dataset

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The dataset contains transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) accounts for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Credit Card Fraud Detection¶

Background

“Continuing the trend of prior years, the cost of fraud continues to rise for global financial institutions…Fraudsters continuously test for the weakest entry point in the financial transaction system and these institutions should apply a multi-layered approach to fraud prevention to combat this growing issue.” -LexisNexis Risk Solutions Sr. Director of Fraud and Identity Management Strategy Kimberly Sutherland
Nilson reports that card fraud (credit, debt, etc) was reportedly 31.26 billion dollars in 2018 and expected to increase to 32.82 billion dollars in 2019.
For perspective, in 2018 both PayPal’s and Mastercard’s revenue were only 15.45 and 14.95 billion dollars each.

Detecting Fraud is typically challenging because of these four characteristics of fraud:

Uncommon: Fraud cases are in a minority, sometimes only 0, sometimes only 0.01% of a company’s transactions are fraudulent. If there are few cases of fraud, then there’s little data to learn how to identify them. This is known as class imbalance, and it’s one of the main challenges of fraud detection.
Concealed: Fraudsters will also try their best to blend in and conceal their activities.
Change over time: Fraudsters will find new methods to avoid getting caught and change their behaviors over time.
Organized: Fraudsters oftentimes work together and organize their activities in a network, making it harder to detect.

How does a company deal with fraud?

Use rules-based systems,based on manually set thresholds and experience to filter out strange cases.
Check the news: the fraud analytics team check the news for suspicious names.
Receive external lists of fraudulent accounts and names: keep track of the external hits lists from the police to reference check against the client base.
Use machine learning algorithms to detect suspicious names and behaviors.
Combine different strategies and various models together to avoid sub-par detection results since organized crime schemes are so sophisticated and quick to adapt.

Machine Learning in Fraud Detection

Traditional rules-based expert systems are not enough to catch fraud. They can do an excellent job of uncovering known patterns; but alone aren’t very effective at uncovering unknown schemes, adapting to new fraud patterns, or handling fraudsters’ increasingly sophisticated techniques. And this is where machine learning becomes necessary for fraud detection.
Many in the financial services industry have updated their fraud detection to include some basic machine learning algorithms including various clustering classifiers, linear approaches, and support vector machines. The most advanced companies in the financial services industry, such as PayPal, have been pioneering more advanced artificial intelligence techniques such as deep neural networks and autoencoders. When building a machine-learning model suite for fraud detection, it is very important not only to identify bad activity(high true positive rate) but also to allow good transactions to go through(low false positive rate).

Supervised and Unsupervised Machine Learning

Supervised Machine Learning: A model that is trained on a set of properly “labeled” transactions. Each transaction is tagged as either fraud or non-fraud. Supervised machine learning model accuracy is directly correlated with the amount of clean, relevant training data. Common supervised machine learning methods include linear regression, logistic regression,
Unsupervised Machine Learning: A model that is trained in cases where tagged transaction data is relatively thin or non-existent. Unsupervised models are designed to discover outliers that represent previously unseen forms of fraud. In the real world of fraud detection, well labeled data is very rare. Therefore supervised machine learning methods alone can not do a good job and unsupervised learning will play an important role in the war.

Model Evaluation in Credit Card Fraud Detection

Accuracy isn’t everything. When working with highly imbalanced data, accuracy is not a reliable performance metric. Because by doing nothing but just predicting everything is in the maority class, you can obtain a higher accuracy than by building a predictive model.
Precision: true positives / (true positives + false positives)
Recall : true positives / (true positives +false negatives)
F1-score: 2 x Precision x Recall / (Precision + Recall) = 2 x TP / (2 x TP + FP + FN)
A credit card company wants to watch as much fraud as possible(reduce false negatives) as fraudulent transactios can be very costly and a false alarm means someone’s transaction is blocked(reduce false positives). The credit card company therefore wants to optimize recall. F-score takes into account a balance between precision and recall.
Precision-Recall Curve(PR): Precision vs. Recall at various threshold settings.
Average Precision(AP):
Receiver Operating Characteristic Curve(ROC): Ture positive rate vs. False positive rate at various shreshold settings. It’s useful to compare performance of different algorithms for fraud detection.
Area Under the Receiver Operating Characteristic Curve (ROCAUC):
The AUROC answers the question: “How well can this classifier be expected to perform in general, at a variety of different baseline probabilities?” but precision and recall don’t.
Confusion Matrix: shows how many fraud cases you can predict correctly.
Classification Report: tells you about the precision and recall of your model.

Map of This Project

This project will use several different machine learning algorithms, from logistic regression to one of more advanced techniques, autoencoders.
Module 1: Data Exploration
Module 2: Resampling for Imbalanced Data
Module 3: Logistic Regression
Module 4: Decision Tree Classifier
Module 5: Random Forest Classifier
Module 6: Voting Classifier
Module 7: K-means Clustering
Module 9: BDS
Module 9: Autoencoder Neural Networks

Dataset Context

The data contains 284,807 European credit card transactions with 492 fraudulent transactions that occurred over two days in September 2013.
Everything except the time and amount has been reduced by a Principle Component Analysis (PCA) for privacy concerns. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.
In order to implement a PCA transformation, features need to be previously scaled. So features V1, V2, … V28 have been scaled already.
Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset.
The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Given the class imbalance ratio, Area Under the Precision-Recall Curve (AUPRC) are recommend to measure the accuracy. Confusion matrix accuracy is not meaningful for unbalanced classification due to this Accuracy Paradox

Module 1: Data Exploration

Import modules, methods and our dataset

In [1]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Read csv
df = pd.read_csv("../input/creditcard.csv")

['creditcard.csv']

Check and visulaize Fraud to Non-fraud Ratio

In [2]:

# Explore the features avaliable in our dataframe
df.shape
df.info()
df.head()
df.describe()
print(df.Amount.describe())

# Count the occurrences of fraud and no fraud cases
fnf = df["Class"].value_counts()

# Print the ratio of fraud cases 
print(fnf/len(df))

# Plottingg your data
plt.xlabel("Class")
plt.ylabel("Number of Observations")
fnf.plot(kind = 'bar',title = 'Frequency by observation number',rot=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
count    284807.000000
mean         88.349619
std         250.120109
min           0.000000
25%           5.600000
50%          22.000000
75%          77.165000
max       25691.160000
Name: Amount, dtype: float64
0    0.998273
1    0.001727
Name: Class, dtype: float64

Out[2]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f04f31c8908>

In [3]:

# Plot how fraud and non-fraud cases are scattered 
plt.scatter(df.loc[df['Class'] == 0]['V1'], df.loc[df['Class'] == 0]['V2'], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(df.loc[df['Class'] == 1]['V1'], df.loc[df['Class'] == 1]['V2'], label="Class #1", alpha=0.5, linewidth=0.15,c='r')
plt.show()

Dataset Summary:

We have 284807 entries within 30 features and 1 target (Class).
There are no “Null” values, so no need to work on ways to replace values.
The mean of all the mounts made is relatively small, approximately USD 88.
Most of the transactions were Non-Fraud (99.83%) of the time, while Fraud transactions occurs (017%) of the time in the dataframe.

Distribution of 2 Features : Time and Amount

In [4]:

import seaborn as sns

fig, ax = plt.subplots(1, 2, figsize=(18,4))

# Plot the distribution of 'Time' feature 
sns.distplot(df['Time'].values/(60*60), ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Time', fontsize=14)
ax[0].set_xlim([min(df['Time'].values/(60*60)), max(df['Time'].values/(60*60))])

sns.distplot(df['Amount'].values, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Amount', fontsize=14)
ax[1].set_xlim([min(df['Amount'].values), max(df['Amount'].values)])

plt.show()

/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Summary:

Time: Most transactions happended in day time.
Mean of transaction amount is 88 USD and 75% quatile is 77 USD.
We should better scale these two skewed features also.

Cut Up the Dataset into Two Datasets and Summarize

In [5]:

# Seperate total data into non-fraud and fraud cases
df_nonfraud = df[df.Class == 0] #save non-fraud df observations into a separate df
df_fraud = df[df.Class == 1] #do the same for frauds

Compare the Amount of transactions in two separate datasets

See if we can flag fraud cases by transaction amount

In [6]:

# Summarize statistics and see differences between fraud and normal transactions
print(df_nonfraud.Amount.describe())
print('_'*25)
print(df_fraud.Amount.describe())

# Import the module
from scipy import stats
F, p = stats.f_oneway(df['Amount'][df['Class'] == 0], df['Amount'][df['Class'] == 1])
print("F:", F)
print("p:",p)

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64
_________________________
count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64
F: 9.033344712018891
p: 0.0026512206498171095

Summary:

The mean transaction amout among fraud cases is 122 USD, and is 88 among non-fraud cases. And the difference is statistically significant.

Transaction Amount Visualization

Expect a lot of low-value transactions to be uninteresting (buying cups of coffee, lunches, etc).
Only visualizes the transactions between USD 200 and 2000.

In [7]:

# Plot of high value transactions($200-$2000)
bins = np.linspace(200, 2000, 100)
plt.hist(df_nonfraud.Amount, bins, alpha=1, normed=True, label='Non-Fraud')
plt.hist(df_fraud.Amount, bins, alpha=1, normed=True, label='Fraud')
plt.legend(loc='upper right')
plt.title("Amount by percentage of transactions (transactions \$200-$2000)")
plt.xlabel("Transaction amount (USD)")
plt.ylabel("Percentage of transactions (%)")
plt.show()

/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Summary:

In the long tail, fraud transaction happened more frequently.
It seems It would be hard to differentiate fraud from normal transactions by transaction amount alone.

Transaction Hour

Let’s look at the transaction percentage from day 0 to the next day.

In [8]:

# Plot of transactions in 48 hours
bins = np.linspace(0, 48, 48) #48 hours
plt.hist((df_nonfraud.Time/(60*60)), bins, alpha=1, normed=True, label='Non-Fraud')
plt.hist((df_fraud.Time/(60*60)), bins, alpha=0.6, normed=True, label='Fraud')
plt.legend(loc='upper right')
plt.title("Percentage of transactions by hour")
plt.xlabel("Transaction time from first transaction in the dataset (hours)")
plt.ylabel("Percentage of transactions (%)")
plt.show()

/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/opt/conda/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Hour “zero” corresponds to the hour the first transaction happened and not necessarily 12-1am.
Given the heavy decrease in normal transactions from hours 1 to 8 and again roughly at hours 24 to 32, it seems fraud tends to occur at higher rates during the night.
Statistical tests could be used to give evidence for this fact.

Transaction Amount vs. Hour

In [9]:

# Plot of transactions in 48 hours
plt.scatter((df_nonfraud.Time/(60*60)), df_nonfraud.Amount, alpha=0.6, label='Non-Fraud')
plt.scatter((df_fraud.Time/(60*60)), df_fraud.Amount, alpha=0.9, label='Fraud')
plt.title("Amount of transaction by hour")
plt.xlabel("Transaction time as measured from first transaction in the dataset (hours)")
plt.ylabel('Amount (USD)')
plt.legend(loc='upper right')
plt.show()

It is not enough to make a good classifier.
For example, it would be hard to draw a line that cleanly separates fraud and non-fraud transactions.

Feature Scaling

As we know before, features V1-V28 have been transformed by PCA and scaled already. Whereas feature “Time” and “Amount” have not. And considering that we will analyze these two features with other V1-V28, they should better be scaled before we train our model using various algorithms. Here is why and how.
Which scaling mehtod should we use?
The Standard Scaler is not recommended as “Time” and “Amount” features are not normally distributed.
The Min-Max Scaler is also not recommende as there are noticeable outliers in feature “Amount”.
The Robust Scaler are robust to outliers: (xi–Q1(x))/( Q3(x)–Q1(x)) (Q1 and Q3 represent 25% and 75% quartiles).
So we choose Robust Scaler to scale these two features.

In [10]:

# Scale "Time" and "Amount"
from sklearn.preprocessing import StandardScaler, RobustScaler
df['scaled_amount'] = RobustScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = RobustScaler().fit_transform(df['Time'].values.reshape(-1,1))

# Make a new dataset named "df_scaled" dropping out original "Time" and "Amount"
df_scaled = df.drop(['Time','Amount'],axis = 1,inplace=False)
df_scaled.head()

Out[10]:

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	scaled_amount	scaled_time
0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	0.207971	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	1.783274	-0.994983
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	1.612727	1.065235	0.489095	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	-0.269825	-0.994983
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	0.624501	0.066084	0.717293	-0.165946	2.345865	-2.890083	1.109969	-0.121359	-2.261857	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	4.983721	-0.994972
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	-0.226487	0.178228	0.507757	-0.287924	-0.631418	-1.059647	-0.684093	1.965775	-1.232622	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	1.418291	-0.994972
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	-0.822843	0.538196	1.345852	-1.119670	0.175121	-0.451449	-0.237033	-0.038195	0.803487	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	0.670579	-0.994960

Correlation Matrices

Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud.

In [11]:

# Calculate pearson correlation coefficience
corr = df_scaled.corr() 

# Plot heatmap of correlation
f, ax = plt.subplots(1, 1, figsize=(24,20))
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20})
ax.set_title("Imbalanced Correlation Matrix \n (don't use for reference)", fontsize=24)

Out[11]:

Text(0.5,1,"Imbalanced Correlation Matrix \n (don't use for reference)")

Module 2: Resampling for Imbalanced Data

There are two types of resampling methods to deal with imbalanced data, one is under sampling and another one is over sampling.

Under sampling: you take ramdom draws from non-fraud observations to match the amount of fraud observations. But you’re randomly throwing away a lot of data and infromation. aka: Random Under Sampling
Over sampling: you take ramdom draws from frad cases and copy these observations to increase to amount of fraud samples in your data. But you are traning your model many many duplicates. aka: Random Over Sampling & SMOTE
Synthetic Minority Oversampling Technique(SMOTE): Adjust the data imbalance by oversampling the monority observations(fraud cases) using nearest neighbors of fraud cases to create new synthetic fraud cases instead of just coping the monority samples.
There is a common mistake when doing resampling, that is testing your model on the oversampled or undersampled dataset. If we want to implement cross validation, remember to split your data into training and testing before oversample or undersample and then just oversample or undersample the training part.
Another way to avoid this is to use “Pipeline” method.

Extract features from our scaled dataset “df_scaled”

In [12]:

# Define the prep_data function to extrac features 
def prep_data(df):
    X = df.drop(['Class'],axis=1, inplace=False) #  
    X = np.array(X).astype(np.float)
    y = df[['Class']]  
    y = np.array(y).astype(np.float)
    return X,y

# Create X and y from the prep_data function 
X, y = prep_data(df_scaled)

Resample data with RUS, ROS and SMOTE

In [13]:

from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.pipeline import Pipeline # Inorder to avoid testing model on sampled data

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Define the resampling method
undersam = RandomUnderSampler(random_state=0)
oversam = RandomOverSampler(random_state=0)
smote = SMOTE(kind='regular',random_state=0)
borderlinesmote = BorderlineSMOTE(kind='borderline-2',random_state=0)

# resample the training data
X_undersam, y_undersam = undersam.fit_sample(X_train,y_train)
X_oversam, y_oversam = oversam.fit_sample(X_train,y_train)
X_smote, y_smote = smote.fit_sample(X_train,y_train)
X_borderlinesmote, y_borderlinesmote = borderlinesmote.fit_sample(X_train,y_train)

Using TensorFlow backend.
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Module 3: Logistic Regression

In [14]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Fit a logistic regression model to our data
model = LogisticRegression()
model.fit(X_train, y_train)

# Obtain model predictions
y_predicted = model.predict(X_test)

/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Model Evaluation

In [15]:

from sklearn.metrics import roc_curve,roc_auc_score, precision_recall_curve, average_precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create true and false positive rates
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_predicted)

# Calculate Area Under the Receiver Operating Characteristic Curve 
probs = model.predict_proba(X_test)
roc_auc = roc_auc_score(y_test, probs[:, 1])
print('ROC AUC Score:',roc_auc)

# Obtain precision and recall 
precision, recall, thresholds = precision_recall_curve(y_test, y_predicted)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Define a roc_curve function
def plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc):
    plt.plot(false_positive_rate, true_positive_rate, linewidth=5, label='AUC = %0.3f'% roc_auc)
    plt.plot([0,1],[0,1], linewidth=5)
    plt.xlim([-0.01, 1])
    plt.ylim([0, 1.01])
    plt.legend(loc='upper right')
    plt.title('Receiver operating characteristic curve (ROC)')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

# Define a precision_recall_curve function
def plot_pr_curve(recall, precision, average_precision):
    plt.step(recall, precision, color='b', alpha=0.2, where='post')
    plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
    plt.show()

# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))

# Plot the roc curve 
plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc)

# Plot recall precision curve
plot_pr_curve(recall, precision, average_precision)

ROC AUC Score: 0.9697298212100447
Classification report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.88      0.62      0.73       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.94      0.81      0.86     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85284    12]
 [   56    91]]

Accuracy score= 99.92% which is higher than the baseline 99.83%.
Precision = 91/(12+91) = 0.88. The rate of true positive in all positive cases.
Recall = 91/ (56+91) = 0.62. The rate of true positive in all true cases.
F1-score = 0.73
False positives cases = 12.

Logistic Regression with Resampled Data

In [16]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Resample your training data
rus = RandomUnderSampler()
ros = RandomOverSampler()
smote = SMOTE(kind='regular',random_state=5)
blsmote = BorderlineSMOTE(kind='borderline-2',random_state=5)

X_train_rus, y_train_rus = rus.fit_sample(X_train,y_train)
X_train_ros, y_train_ros = ros.fit_sample(X_train,y_train)
X_train_smote, y_train_smote = smote.fit_sample(X_train,y_train)
X_train_blsmote, y_train_blsmote = blsmote.fit_sample(X_train,y_train)

# Fit a logistic regression model to our data
rus_model = LogisticRegression().fit(X_train_rus, y_train_rus)
ros_model = LogisticRegression().fit(X_train_ros, y_train_ros)
smote_model = LogisticRegression().fit(X_train_smote, y_train_smote)
blsmote_model = LogisticRegression().fit(X_train_blsmote, y_train_blsmote)

y_rus = rus_model.predict(X_test)
y_ros = ros_model.predict(X_test)
y_smote = smote_model.predict(X_test)
y_blsmote = blsmote_model.predict(X_test)

print('Classifcation report:\n', classification_report(y_test, y_rus))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_rus))
print('*'*25)

print('Classifcation report:\n', classification_report(y_test, y_ros))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_ros))
print('*'*25)

print('Classifcation report:\n', classification_report(y_test, y_smote))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_smote))
print('*'*25)

print('Classifcation report:\n', classification_report(y_test, y_blsmote))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_blsmote))
print('*'*25)

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      0.97      0.98     85296
         1.0       0.05      0.90      0.09       147

   micro avg       0.97      0.97      0.97     85443
   macro avg       0.52      0.94      0.54     85443
weighted avg       1.00      0.97      0.98     85443

Confusion matrix:
 [[82652  2644]
 [   14   133]]
*************************
Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     85296
         1.0       0.07      0.92      0.12       147

   micro avg       0.98      0.98      0.98     85443
   macro avg       0.53      0.95      0.56     85443
weighted avg       1.00      0.98      0.99     85443

Confusion matrix:
 [[83358  1938]
 [   12   135]]
*************************
Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     85296
         1.0       0.06      0.92      0.11       147

   micro avg       0.98      0.98      0.98     85443
   macro avg       0.53      0.95      0.55     85443
weighted avg       1.00      0.98      0.99     85443

Confusion matrix:
 [[83174  2122]
 [   12   135]]
*************************
Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     85296
         1.0       0.08      0.88      0.14       147

   micro avg       0.98      0.98      0.98     85443
   macro avg       0.54      0.93      0.57     85443
weighted avg       1.00      0.98      0.99     85443

Confusion matrix:
 [[83775  1521]
 [   18   129]]
*************************

Logistic Regression with sampled Data using Pipeline

In [17]:

# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline 

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Define which resampling method and which ML model to use in the pipeline
resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2') 
model = LogisticRegression() 

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(X_train, y_train) 
y_predicted = pipeline.predict(X_test)

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99     85296
         1.0       0.08      0.88      0.15       147

   micro avg       0.98      0.98      0.98     85443
   macro avg       0.54      0.93      0.57     85443
weighted avg       1.00      0.98      0.99     85443

Confusion matrix:
 [[83796  1500]
 [   18   129]]

As you can see, with the BorderlineSMOTE resampling method, we can get the best f1-score 0.15 compared with other 3 reampling methods. Not in all cases does resampling necessarily lead to better results.
When the fraud cases are very spread and scattered over the data, using SMOTE can introduce a bit of bias.
Nearest neighbors aren’t necessarily also fraud cases, so the synthetic samples might ‘confuse’ the model slightly.

Module 4: Decision Tree Classifier

In [18]:

# Import the decision tree model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Fit a logistic regression model to our data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Obtain model predictions
y_predicted = model.predict(X_test)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, y_predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))

Classification report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.77      0.74      0.76       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.89      0.87      0.88     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85264    32]
 [   38   109]]

Precision = 113/(113+25) = 0.82. The rate of true positive in all positive cases.
Recall = 113/ (113+34) = 0.77. The rate of true positive in all true cases.
F1-score = 0.79 False positives cases = 31.

Decision Tree Classifier with SMOTE Data

In [19]:

# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline 
from imblearn.over_sampling import BorderlineSMOTE

# Define which resampling method and which ML model to use in the pipeline
resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2') 
model = DecisionTreeClassifier() 

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Decision Tree Classifier', model)])

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(X_train, y_train) 
y_predicted = pipeline.predict(X_test)

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',  confusion_matrix(y_true = y_test, y_pred = y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.57      0.73      0.64       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.78      0.86      0.82     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85215    81]
 [   40   107]]

Precision = 0.63. The rate of true positive in all positive cases.
Recall = 0.71. The rate of true positive in all true cases.
F1-score = 0.66
False positives cases = 62.

Module 5: Random Forest Classifier

In [20]:

# Import the Random Forest Classifier model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Fit a logistic regression model to our data
model = RandomForestClassifier(random_state=5)
model.fit(X_train, y_train)

# Obtain model predictions
y_predicted = model.predict(X_test)

# Predict probabilities
probs = model.predict_proba(X_test)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, y_predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

# Print the classifcation report and confusion matrix
print(accuracy_score(y_test, y_predicted))
print("AUC ROC score: ", roc_auc_score(y_test, probs[:,1]))

print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:13: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  del sys.path[0]

0.9994850368081645
AUC ROC score:  0.9214466198221927
Classification report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.96      0.73      0.83       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.98      0.87      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85291     5]
 [   39   108]]

Accuracy score = Precision = 0.95. The rate of true positive in all positive cases.
Recall = 0.73. The rate of true positive in all true cases.
F1-score = 0.83 False positives cases = 6, which is much better.

Random Forest Classifier with SMOTE Data Catch Fraud

In [21]:

# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline 
from imblearn.over_sampling import BorderlineSMOTE

# Define which resampling method and which ML model to use in the pipeline

resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2') 
model = RandomForestClassifier() 

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Random Forest Classifier', model)])

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(X_train, y_train) 
y_predicted = pipeline.predict(X_test)

# Predict probabilities
probs = model.predict_proba(X_test)

print(accuracy_score(y_test, y_predicted))
print("AUC ROC score: ", roc_auc_score(y_test, probs[:,1]))
# Obtain the results from the classification report and confusion matrix 

print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',  confusion_matrix(y_true = y_test, y_pred = y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

0.9994967405170698
AUC ROC score:  0.9243695344391742
Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.91      0.78      0.84       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.96      0.89      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85285    11]
 [   32   115]]

Random Forest Classifier Model adjustments

In [22]:

# Import the Random Forest Classifier model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Define the model with balanced subsample
model = RandomForestClassifier(bootstrap=True,
                               class_weight={0:1, 1:12}, # 0: non-fraud , 1:fraud
                               criterion='entropy',
                               max_depth=10, # Change depth of model
                               min_samples_leaf=10, # Change the number of samples in leaf nodes
                               n_estimators=20, # Change the number of trees to use
                               n_jobs=-1, 
                               random_state=5)

# Fit your training model to your training set
model.fit(X_train, y_train)

# Obtain the predicted values and probabilities from the model 
y_predicted = model.predict(X_test)

# Calculate probs
probs = model.predict_proba(X_test)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, y_predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

# Print the roc auc score, the classification report and confusion matrix
print("auc roc score: ", roc_auc_score(y_test, probs[:,1]))
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_test, y_predicted))

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:22: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

auc roc score:  0.972386276776702
Classifcation report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.85      0.82      0.83       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.92      0.91      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85274    22]
 [   27   120]]

The model results don’t improve drastically.
If we mostly care about catching fraud, and not so much about the false positives, this does actually not improve our model at all, albeit a simple option to try.
By smartly defining more options in the model, you can obtain better predictions.

GridSearchCV to find optimal parameters for Random Forest Classifier

In [23]:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter sets to test
param_grid = {
    'n_estimators': [1, 30], 
    'max_features': ['auto', 'log2'],  
    'max_depth': [4, 8], 
    'criterion': ['gini', 'entropy']
}

# Define the model to use
model = RandomForestClassifier(random_state=5)

# Combine the parameter sets with the defined model
CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='recall', n_jobs=-1)

# Fit the model to our training data and obtain best parameters
CV_model.fit(X_train, y_train)
CV_model.best_params_

/opt/conda/lib/python3.6/site-packages/sklearn/model_selection/_search.py:740: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  self.best_estimator_.fit(X, y, **fit_params)

Out[23]:

{'criterion': 'entropy',
 'max_depth': 8,
 'max_features': 'auto',
 'n_estimators': 30}

Model results using GridSearchCV

In [24]:

from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Build a RandomForestClassifier using the GridSearchCV parameters
model = RandomForestClassifier(bootstrap=True,
                               class_weight = {0:1,1:12},
                               criterion = 'entropy',
                               n_estimators = 30,
                               max_features = 'auto',
                               min_samples_leaf = 10,
                               max_depth = 8,
                               n_jobs = -1,
                               random_state = 5)

# Fit the model to your training data and get the predicted results
model.fit(X_train,y_train)
y_predicted = model.predict(X_test)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, y_predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

# Print the roc_auc_score,Classifcation report and Confusin matrix
probs = model.predict_proba(X_test)
print('roc_auc_score:', roc_auc_score(y_test,probs[:,1]))
print('Classification report:\n',classification_report(y_test,y_predicted))
print('Confusion_matrix:\n',confusion_matrix(y_test,y_predicted))

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

roc_auc_score: 0.9783260964299432
Classification report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.86      0.81      0.83       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.93      0.90      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion_matrix:
 [[85276    20]
 [   28   119]]

The results of this model just does not perform better.

Module 6: Voting Classifier

In [25]:

# Import modules 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

# Define the three classifiers to use in the ensemble
clf1 = LogisticRegression(class_weight={0:1,1:15},random_state=5)
clf2 = RandomForestClassifier(class_weight={0:1,1:12},
                              criterion='entropy',
                              max_depth=10,
                              max_features='auto',
                              min_samples_leaf=10, 
                              n_estimators=20,
                              n_jobs=-1,
                              random_state=5)
clf3 = DecisionTreeClassifier(class_weight='balanced',random_state=5)

# Combine the classifiers in the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], voting='hard')

# Fit the model to your training data and get the predicted results
ensemble_model.fit(X_train,y_train)
y_predicted = ensemble_model.predict(X_test)

# print roc auc score , Classification report and Confusion matrix of the model
print('Classifier report:\n',classification_report(y_test,y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_test,y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Classifier report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.85      0.82      0.83       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.92      0.91      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85274    22]
 [   27   120]]

By combining the classifiers, you can take the best of multiple models.
By combining these together you indeed managed to improve performance.

In [26]:

# Adjust weights within the Voting Classifier

# Define the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 
                                  voting='soft', 
                                  weights=[1, 4, 1], 
                                  flatten_transform=True)

# Fit the model to your training data and get the predicted results
ensemble_model.fit(X_train,y_train)
y_predicted = ensemble_model.predict(X_test)

# Calculate average precision 
average_precision = average_precision_score(y_test, y_predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, y_predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

# print roc auc score , Classification report and Confusion matrix of the model
print('Classifier report:\n',classification_report(y_test,y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_test,y_predicted))

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Classifier report:
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     85296
         1.0       0.85      0.82      0.83       147

   micro avg       1.00      1.00      1.00     85443
   macro avg       0.93      0.91      0.92     85443
weighted avg       1.00      1.00      1.00     85443

Confusion matrix:
 [[85275    21]
 [   27   120]]

In [27]:

ensemble_model.estimators_

Out[27]:

[LogisticRegression(C=1.0, class_weight={0: 1, 1: 15}, dual=False,
           fit_intercept=True, intercept_scaling=1, max_iter=100,
           multi_class='warn', n_jobs=None, penalty='l2', random_state=5,
           solver='warn', tol=0.0001, verbose=0, warm_start=False),
 RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 12},
             criterion='entropy', max_depth=10, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=10,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=20, n_jobs=-1, oob_score=False, random_state=5,
             verbose=0, warm_start=False),
 DecisionTreeClassifier(class_weight='balanced', criterion='gini',
             max_depth=None, max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=5,
             splitter='best')]

The weight option allows you to play with the individual models to get the best final mix for your fraud detection model.
But the model performance does not improve.

Module 7: KMeans Clustering

Prepare unlabeled train and test dataset

In [28]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize

# Split the data into train set and test set
train,test = train_test_split(df,test_size=0.3,random_state=0)

# Get the arrays of features and labels in train dataset
features_train = train.drop(['Time','Class'],axis=1)
features_train = features_train.values
labels_train = pd.DataFrame(train[['Class']])
labels_train = labels_train.values

# Get the arrays of features and labels in test dataset
features_test = test.drop(['Time','Class'],axis=1)
features_test = features_test.values
labels_test = pd.DataFrame(test[["Class"]])
labels_test = labels_test.values

# Normalize the features in both train and test dataset
features_train = normalize(features_train)
features_test = normalize(features_test)

Build the model

In [29]:

from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix

model = KMeans(n_clusters=2,random_state=0)
model.fit(features_train)
labels_train_predicted = model.predict(features_train)
labels_test_predicted = model.predict(features_test)

# Decide if model predicted label is aligned with true label 
true_negative,false_positive,false_negative,true_positive = confusion_matrix(labels_train,labels_train_predicted).ravel()
reassignflag = true_negative + true_positive < false_positive + false_negative
print(reassignflag)


labels_test_predicted = 1- labels_test_predicted

True

Model Evaluation

In [30]:

from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score,f1_score
# Calculating confusion matrix for kmeans
print('Confusion Matrix:\n',confusion_matrix(labels_test,labels_test_predicted))

# Scoring kmeans

print('kmeans_precison_score:', precision_score(labels_test,labels_test_predicted))
print('kmeans_recall_score:', recall_score(labels_test,labels_test_predicted))
print('kmeans_accuracy_score:', accuracy_score(labels_test,labels_test_predicted))
print('kmeans_f1_score:',f1_score(labels_test,labels_test_predicted))

Confusion Matrix:
 [[67805 17491]
 [   56    91]]
kmeans_precison_score: 0.0051757479240131955
kmeans_recall_score: 0.6190476190476191
kmeans_accuracy_score: 0.7946350198377866
kmeans_f1_score: 0.010265666422246038

We can detect 91 out of 147 fraud cases in the test dataset.
But there are 17361 false positive cases which indicated that our KMeans model needs to be improved by selecting good features.

Module 8: MiniBatchKMeans Clustering

In [31]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize

# Split the data into train set and test set
train,test = train_test_split(df,test_size=0.3,random_state=0)

# Get the arrays of features and labels in train dataset
features_train = train.drop(['Time','Class'],axis=1)
features_train = features_train.values
labels_train = pd.DataFrame(train[['Class']])
labels_train = labels_train.values

# Get the arrays of features and labels in test dataset
features_test = test.drop(['Time','Class'],axis=1)
features_test = features_test.values
labels_test = pd.DataFrame(test[["Class"]])
labels_test = labels_test.values

# Normalize the features in both train and test dataset
features_train = normalize(features_train)
features_test = normalize(features_test)

In [32]:

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import confusion_matrix

model = MiniBatchKMeans(n_clusters=2,random_state=0)
model.fit(features_train)
labels_train_predicted = model.predict(features_train)
labels_test_predicted = model.predict(features_test)

# Decide if model predicted label is aligned with true label 
true_negative,false_positive,false_negative,true_positive = confusion_matrix(labels_train,labels_train_predicted).ravel()
reassignflag = true_negative + true_positive < false_positive + false_negative
print(reassignflag)

False

In [33]:

from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score,f1_score
# Calculating confusion matrix for kmeans
print('Confusion Matrix:\n',confusion_matrix(labels_test,labels_test_predicted))

# Scoring kmeans

print('kmeans_precison_score:', precision_score(labels_test,labels_test_predicted))
print('kmeans_recall_score:', recall_score(labels_test,labels_test_predicted))
print('kmeans_accuracy_score:', accuracy_score(labels_test,labels_test_predicted))
print('kmeans_f1_score:',f1_score(labels_test,labels_test_predicted))

Confusion Matrix:
 [[67837 17459]
 [   56    91]]
kmeans_precison_score: 0.005185185185185185
kmeans_recall_score: 0.6190476190476191
kmeans_accuracy_score: 0.7950095385227579
kmeans_f1_score: 0.01028422896536136

We can detect 91 out of 147 fraud cases in the test dataset.
But there are 17341 false positive cases which indicated that our MiniBatchKMeans model needs to be improved by selecting good features.

Module 9: Autoencoders

Prepare training data and testing data

First, drop the “Time” column (not going to use it as it is unimportant) and use the scikit’s StandardScaler on the Amount. The scaler removes the mean and scales the values to unit variance.
Autoencoder is gonna be a bit different from what we are used to. We will create this situation by training our model on the normal transactions, only.
Reserve 30% of our data for testing

In [34]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Make another copy of df and drop the unimportant "Time" feature
data = df.drop(['Time'], axis=1) 

# Use scikit’s StandardScaler on the "Amount" feature
# The scaler removes the mean and scales the values to unit variance
data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

# Create the training and testing sets
X1_train, X1_test = train_test_split(data, test_size=.3, random_state=0)
X1_train = X1_train[X1_train.Class == 0] # train the model on normal transactions
X1_train = X1_train.drop(['Class'], axis=1)

y1_test = X1_test['Class']
X1_test  = X1_test.drop(['Class'], axis=1) #drop the class column


#transform to ndarray
X1_train = X1_train.values
X1_test = X1_test.values
X1_train.shape

Out[34]:

(199019, 31)

Build the Autoencoder Model

Use 4 fully connected layers with 14, 7, 7 and 29 neurons respectively for this Autoencoder model.
The first two layers are used for encoder, the last two go for the decoder.
Use L1 regularization during training

In [35]:

import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers

input_dim = X1_train.shape[1] #num of columns, 29
encoding_dim = 14
hidden_dim = int(encoding_dim / 2)
learning_rate = 1e-5

input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim, 
                activation="tanh", 
                activity_regularizer=regularizers.l1(learning_rate))(input_layer)
encoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(hidden_dim, activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)

Train the Autoencoder Model

Train our model for 100 epochs with a batch size of 128 samples and save the best performing model to a file.
The ModelCheckpoint provided by Keras is really handy for such tasks.
Additionally, the training progress will be exported in a format that TensorBoard understands.

In [36]:

nb_epoch = 100
batch_size = 128
autoencoder.compile(metrics=['accuracy'],
                    loss='mean_squared_error',
                    optimizer='adam')

checkpointer = ModelCheckpoint(filepath='autoencoder_fraud.h5',
                               save_best_only=True,
                               verbose=0)

tensorboard = TensorBoard(log_dir='./logs',
                          histogram_freq=0,
                          write_graph=True,
                          write_images=True)

history = autoencoder.fit(X1_train, X1_train,
                          epochs=nb_epoch,
                          batch_size=batch_size,
                          shuffle=True,
                          validation_data=(X1_test, X1_test),
                          verbose=1,
                          callbacks=[checkpointer, tensorboard]).history
load_model('autoencoder_fraud.h5')

Train on 199019 samples, validate on 85443 samples
Epoch 1/100
199019/199019 [==============================] - 3s 17us/step - loss: 1.0959 - acc: 0.5132 - val_loss: 1.0509 - val_acc: 0.6455
Epoch 2/100
199019/199019 [==============================] - 3s 15us/step - loss: 0.9123 - acc: 0.6655 - val_loss: 0.9650 - val_acc: 0.6678
Epoch 3/100
199019/199019 [==============================] - 3s 15us/step - loss: 0.8541 - acc: 0.6769 - val_loss: 0.9190 - val_acc: 0.6883
Epoch 4/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.8187 - acc: 0.6889 - val_loss: 0.8992 - val_acc: 0.6871
Epoch 5/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7997 - acc: 0.6917 - val_loss: 0.8785 - val_acc: 0.6907
Epoch 6/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7868 - acc: 0.6947 - val_loss: 0.8691 - val_acc: 0.6976
Epoch 7/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7780 - acc: 0.6981 - val_loss: 0.8599 - val_acc: 0.6991
Epoch 8/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7715 - acc: 0.6994 - val_loss: 0.8518 - val_acc: 0.7027
Epoch 9/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7653 - acc: 0.7016 - val_loss: 0.8455 - val_acc: 0.7006
Epoch 10/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7611 - acc: 0.7026 - val_loss: 0.8413 - val_acc: 0.7003
Epoch 11/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7508 - acc: 0.7027 - val_loss: 0.8284 - val_acc: 0.7053
Epoch 12/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7443 - acc: 0.7058 - val_loss: 0.8262 - val_acc: 0.7106
Epoch 13/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7394 - acc: 0.7076 - val_loss: 0.8218 - val_acc: 0.7036
Epoch 14/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7346 - acc: 0.7085 - val_loss: 0.8142 - val_acc: 0.7098
Epoch 15/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7314 - acc: 0.7105 - val_loss: 0.8109 - val_acc: 0.7126
Epoch 16/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7281 - acc: 0.7123 - val_loss: 0.8135 - val_acc: 0.7031
Epoch 17/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7246 - acc: 0.7155 - val_loss: 0.8033 - val_acc: 0.7138
Epoch 18/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7219 - acc: 0.7174 - val_loss: 0.7986 - val_acc: 0.7182
Epoch 19/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7195 - acc: 0.7164 - val_loss: 0.8006 - val_acc: 0.7154
Epoch 20/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7189 - acc: 0.7158 - val_loss: 0.7969 - val_acc: 0.7198
Epoch 21/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7145 - acc: 0.7180 - val_loss: 0.7905 - val_acc: 0.7223
Epoch 22/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7124 - acc: 0.7192 - val_loss: 0.7922 - val_acc: 0.7117
Epoch 23/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7120 - acc: 0.7185 - val_loss: 0.7954 - val_acc: 0.7141
Epoch 24/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7116 - acc: 0.7185 - val_loss: 0.7847 - val_acc: 0.7234
Epoch 25/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7088 - acc: 0.7196 - val_loss: 0.7838 - val_acc: 0.7206
Epoch 26/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7084 - acc: 0.7194 - val_loss: 0.7827 - val_acc: 0.7256
Epoch 27/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.7067 - acc: 0.7220 - val_loss: 0.7816 - val_acc: 0.7243
Epoch 28/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7074 - acc: 0.7226 - val_loss: 0.7829 - val_acc: 0.7248
Epoch 29/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7060 - acc: 0.7233 - val_loss: 0.7797 - val_acc: 0.7265
Epoch 30/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7043 - acc: 0.7251 - val_loss: 0.7946 - val_acc: 0.7210
Epoch 31/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7043 - acc: 0.7256 - val_loss: 0.7777 - val_acc: 0.7288
Epoch 32/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7074 - acc: 0.7259 - val_loss: 0.7786 - val_acc: 0.7237
Epoch 33/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7019 - acc: 0.7274 - val_loss: 0.7761 - val_acc: 0.7284
Epoch 34/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7023 - acc: 0.7264 - val_loss: 0.7774 - val_acc: 0.7275
Epoch 35/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7049 - acc: 0.7251 - val_loss: 0.7753 - val_acc: 0.7286
Epoch 36/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7001 - acc: 0.7283 - val_loss: 0.7737 - val_acc: 0.7283
Epoch 37/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6997 - acc: 0.7276 - val_loss: 0.7724 - val_acc: 0.7316
Epoch 38/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6990 - acc: 0.7267 - val_loss: 0.7759 - val_acc: 0.7240
Epoch 39/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7001 - acc: 0.7259 - val_loss: 0.7773 - val_acc: 0.7262
Epoch 40/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6983 - acc: 0.7267 - val_loss: 0.7817 - val_acc: 0.7293
Epoch 41/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6974 - acc: 0.7265 - val_loss: 0.7791 - val_acc: 0.7147
Epoch 42/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6979 - acc: 0.7266 - val_loss: 0.7699 - val_acc: 0.7294
Epoch 43/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6982 - acc: 0.7257 - val_loss: 0.7684 - val_acc: 0.7258
Epoch 44/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6964 - acc: 0.7267 - val_loss: 0.7699 - val_acc: 0.7261
Epoch 45/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6985 - acc: 0.7261 - val_loss: 0.7720 - val_acc: 0.7246
Epoch 46/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6957 - acc: 0.7280 - val_loss: 0.7682 - val_acc: 0.7287
Epoch 47/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6949 - acc: 0.7274 - val_loss: 0.7667 - val_acc: 0.7322
Epoch 48/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6966 - acc: 0.7261 - val_loss: 0.7655 - val_acc: 0.7298
Epoch 49/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6953 - acc: 0.7282 - val_loss: 0.7698 - val_acc: 0.7264
Epoch 50/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6947 - acc: 0.7260 - val_loss: 0.7868 - val_acc: 0.7157
Epoch 51/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.7025 - acc: 0.7243 - val_loss: 0.7838 - val_acc: 0.7294
Epoch 52/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6957 - acc: 0.7288 - val_loss: 0.7667 - val_acc: 0.7256
Epoch 53/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6936 - acc: 0.7275 - val_loss: 0.7748 - val_acc: 0.7182
Epoch 54/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6947 - acc: 0.7275 - val_loss: 0.7695 - val_acc: 0.7275
Epoch 55/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6960 - acc: 0.7262 - val_loss: 0.7652 - val_acc: 0.7299
Epoch 56/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6933 - acc: 0.7284 - val_loss: 0.7640 - val_acc: 0.7274
Epoch 57/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6936 - acc: 0.7270 - val_loss: 0.7656 - val_acc: 0.7252
Epoch 58/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6922 - acc: 0.7279 - val_loss: 0.7635 - val_acc: 0.7309
Epoch 59/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6939 - acc: 0.7261 - val_loss: 0.7637 - val_acc: 0.7306
Epoch 60/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6912 - acc: 0.7286 - val_loss: 0.7629 - val_acc: 0.7310
Epoch 61/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6933 - acc: 0.7270 - val_loss: 0.7608 - val_acc: 0.7312
Epoch 62/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6925 - acc: 0.7270 - val_loss: 0.7676 - val_acc: 0.7287
Epoch 63/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6947 - acc: 0.7273 - val_loss: 0.7645 - val_acc: 0.7277
Epoch 64/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6906 - acc: 0.7299 - val_loss: 0.7635 - val_acc: 0.7257
Epoch 65/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6894 - acc: 0.7303 - val_loss: 0.7596 - val_acc: 0.7339
Epoch 66/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6909 - acc: 0.7301 - val_loss: 0.7631 - val_acc: 0.7307
Epoch 67/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6916 - acc: 0.7291 - val_loss: 0.7915 - val_acc: 0.6992
Epoch 68/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6924 - acc: 0.7294 - val_loss: 0.7618 - val_acc: 0.7308
Epoch 69/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6898 - acc: 0.7313 - val_loss: 0.7587 - val_acc: 0.7342
Epoch 70/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6908 - acc: 0.7295 - val_loss: 0.7571 - val_acc: 0.7333
Epoch 71/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6917 - acc: 0.7296 - val_loss: 0.7582 - val_acc: 0.7338
Epoch 72/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6904 - acc: 0.7301 - val_loss: 0.7597 - val_acc: 0.7301
Epoch 73/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6886 - acc: 0.7306 - val_loss: 0.7591 - val_acc: 0.7342
Epoch 74/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6886 - acc: 0.7317 - val_loss: 0.7599 - val_acc: 0.7296
Epoch 75/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6924 - acc: 0.7289 - val_loss: 0.7590 - val_acc: 0.7367
Epoch 76/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6915 - acc: 0.7291 - val_loss: 0.7608 - val_acc: 0.7321
Epoch 77/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6884 - acc: 0.7315 - val_loss: 0.7718 - val_acc: 0.7314
Epoch 78/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6884 - acc: 0.7309 - val_loss: 0.7660 - val_acc: 0.7328
Epoch 79/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6908 - acc: 0.7295 - val_loss: 0.7598 - val_acc: 0.7316
Epoch 80/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6900 - acc: 0.7310 - val_loss: 0.7577 - val_acc: 0.7349
Epoch 81/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6877 - acc: 0.7320 - val_loss: 0.7555 - val_acc: 0.7366
Epoch 82/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6965 - acc: 0.7275 - val_loss: 0.7587 - val_acc: 0.7340
Epoch 83/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6877 - acc: 0.7321 - val_loss: 0.7593 - val_acc: 0.7269
Epoch 84/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6917 - acc: 0.7287 - val_loss: 0.7597 - val_acc: 0.7295
Epoch 85/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6879 - acc: 0.7318 - val_loss: 0.7712 - val_acc: 0.7104
Epoch 86/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6882 - acc: 0.7302 - val_loss: 0.7585 - val_acc: 0.7293
Epoch 87/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6876 - acc: 0.7313 - val_loss: 0.7697 - val_acc: 0.7312
Epoch 88/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6883 - acc: 0.7315 - val_loss: 0.7582 - val_acc: 0.7370
Epoch 89/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6873 - acc: 0.7312 - val_loss: 0.7557 - val_acc: 0.7376
Epoch 90/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6888 - acc: 0.7300 - val_loss: 0.7582 - val_acc: 0.7344
Epoch 91/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6870 - acc: 0.7321 - val_loss: 0.7553 - val_acc: 0.7378
Epoch 92/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6861 - acc: 0.7318 - val_loss: 0.7551 - val_acc: 0.7339
Epoch 93/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6892 - acc: 0.7307 - val_loss: 0.7562 - val_acc: 0.7331
Epoch 94/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6878 - acc: 0.7314 - val_loss: 0.7544 - val_acc: 0.7308
Epoch 95/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6876 - acc: 0.7310 - val_loss: 0.7583 - val_acc: 0.7341
Epoch 96/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6884 - acc: 0.7316 - val_loss: 0.7559 - val_acc: 0.7355
Epoch 97/100
199019/199019 [==============================] - 3s 14us/step - loss: 0.6881 - acc: 0.7306 - val_loss: 0.7579 - val_acc: 0.7288
Epoch 98/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6865 - acc: 0.7329 - val_loss: 0.7538 - val_acc: 0.7337
Epoch 99/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6947 - acc: 0.7290 - val_loss: 0.7779 - val_acc: 0.7276
Epoch 100/100
199019/199019 [==============================] - 3s 13us/step - loss: 0.6863 - acc: 0.7338 - val_loss: 0.7552 - val_acc: 0.7250

Out[36]:

<keras.engine.training.Model at 0x7f04424100b8>

Model Evaluation

The model seems to be performing well enough, although there is significant room for improvement by adding more hidden layers.
More hidden layers would allow this network to encode more complex relationships between the input features. The loss of our current model seems to be converging and more training epochs are not likely going to help.

In [37]:

plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')

Out[37]:

<matplotlib.legend.Legend at 0x7f044100fd68>

Reconstruction Error

Autoencoders are trained to reduce reconstruction error.

In [38]:

predictions = autoencoder.predict(X1_test)
mse = np.mean(np.power(X1_test - predictions, 2), axis=1)
df_error = pd.DataFrame({'reconstruction_error': mse,
                        'true_class': y1_test})
df_error.describe()

Out[38]:

	reconstruction_error	true_class
count	85443.000000	85443.000000
mean	0.749794	0.001720
std	13.327914	0.041443
min	0.054882	0.000000
25%	0.245621	0.000000
50%	0.380879	0.000000
75%	0.601047	0.000000
max	3607.752783	1.000000

ROC Curve

Since we have an imbalanced data set, Receiver Operating Characteristic Curves are not that useful although it’s an expected output of most binary classifiers.
Because you can generate a pretty good-looking curve by just simply guessing each one is the non-fraud case. But let’s have a look at ROC curve first.

In [39]:

# Import modules
from sklearn.metrics import auc, roc_curve,precision_recall_curve
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import recall_score,f1_score,precision_recall_fscore_support

false_positive_rate, true_positive_rate, thresholds = roc_curve(df_error.true_class, df_error.reconstruction_error)
roc_auc = auc(false_positive_rate, true_positive_rate)

# Plot the roc curve 
plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc)

The ROC curve plots the true positive rate versus the false positive rate, over different threshold values.
We basically want the blue line to be as close as possible to the upper left corner.
As our dataset is quite imbalanced, ROC doesn’t look very useful for us even though the results look pretty good.

Recall vs. Precision

Considering the imbalance of our dataset, we take a look at the Recall vs. Precision trade off.

In [40]:

precision, recall, thresholds = precision_recall_curve(df_error.true_class, df_error.reconstruction_error)

# Plot recall precision tradeoff
plt.plot(recall, precision, linewidth=5, label='Precision-Recall curve')
plt.title('Recall vs Precision')
plt.xlabel('Recall')
plt.ylabel('Precision')
print(plt.show())

# Plot precision and recall for different thresholds
plt.plot(thresholds, precision[1:], label="Precision",linewidth=5)
plt.plot(thresholds, recall[1:], label="Recall",linewidth=5)
plt.title('Precision and recall for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.legend()
print(plt.show())

None

None

Prediction

In order to predict whether or not a new transaction is normal or fraudulent, we’ll calculate the reconstruction error from the transaction data itself.
If the error is larger than a predefined threshold, we’ll mark it as a fraud (since our model should have a low error on normal transactions).

In [41]:

# Set a threshold
set_threshold = 5
groups = df_error.groupby('true_class')
fig, ax = plt.subplots()

for name, group in groups:
    ax.plot(group.index, 
            group.reconstruction_error, 
            marker='o', 
            ms=3.5, 
            linestyle='',
            label= "Fraud" if name == 1 else "Nonfraud")
    
ax.hlines(set_threshold, 
          ax.get_xlim()[0], 
          ax.get_xlim()[1], 
          colors="r", 
          zorder=100, 
          label='Threshold')

ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show()

Confusion Matrix

In [42]:

y_pred = [1 if e > set_threshold else 0 for e in df_error.reconstruction_error.values]
print('Confusion_matrix:\n',confusion_matrix(df_error.true_class, y_pred))

Confusion_matrix:
 [[84375   921]
 [   48    99]]

Credit Card Fraud Prediction

Problem Statement

About Dataset

Credit Card Fraud Detection¶

Background

Detecting Fraud is typically challenging because of these four characteristics of fraud:

How does a company deal with fraud?

Machine Learning in Fraud Detection

Supervised and Unsupervised Machine Learning

Model Evaluation in Credit Card Fraud Detection

Map of This Project

Dataset Context

Module 1: Data Exploration

Import modules, methods and our dataset

Check and visulaize Fraud to Non-fraud Ratio

Dataset Summary:

Distribution of 2 Features : Time and Amount

Summary:

Cut Up the Dataset into Two Datasets and Summarize

Compare the Amount of transactions in two separate datasets

Summary:

Transaction Amount Visualization

Summary:

Transaction Hour

Transaction Amount vs. Hour

Feature Scaling

Correlation Matrices

Module 2: Resampling for Imbalanced Data

Extract features from our scaled dataset “df_scaled”

Resample data with RUS, ROS and SMOTE

Module 3: Logistic Regression

Model Evaluation

Logistic Regression with Resampled Data

Logistic Regression with sampled Data using Pipeline

Module 4: Decision Tree Classifier

Decision Tree Classifier with SMOTE Data

Module 5: Random Forest Classifier

Random Forest Classifier with SMOTE Data Catch Fraud

Random Forest Classifier Model adjustments

GridSearchCV to find optimal parameters for Random Forest Classifier

Model results using GridSearchCV

Module 6: Voting Classifier

Module 7: KMeans Clustering

Prepare unlabeled train and test dataset

Build the model

Model Evaluation

Module 8: MiniBatchKMeans Clustering

Module 9: Autoencoders

Prepare training data and testing data

Build the Autoencoder Model

Train the Autoencoder Model

Model Evaluation

Reconstruction Error

ROC Curve

Recall vs. Precision

Prediction

Confusion Matrix

Leave a Reply Cancel Reply