Problem Statement
The objective of this project is to develop a credit card fraud detection system using the dataset from Kaggle. The dataset contains a large number of credit card transactions, with both fraudulent and non-fraudulent cases. The goal is to build a machine learning model that can accurately identify fraudulent transactions to help financial institutions mitigate the risk and minimize losses due to fraudulent activities.
About Dataset
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) accounts for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Credit Card Fraud Detection¶
Background
“Continuing the trend of prior years, the cost of fraud continues to rise for global financial institutions…Fraudsters continuously test for the weakest entry point in the financial transaction system and these institutions should apply a multi-layered approach to fraud prevention to combat this growing issue.” -LexisNexis Risk Solutions Sr. Director of Fraud and Identity Management Strategy Kimberly Sutherland
Nilson reports that card fraud (credit, debt, etc) was reportedly 31.26 billion dollars in 2018 and expected to increase to 32.82 billion dollars in 2019.
For perspective, in 2018 both PayPal’s and Mastercard’s revenue were only 15.45 and 14.95 billion dollars each.
Detecting Fraud is typically challenging because of these four characteristics of fraud:
- Uncommon: Fraud cases are in a minority, sometimes only 0, sometimes only 0.01% of a company’s transactions are fraudulent. If there are few cases of fraud, then there’s little data to learn how to identify them. This is known as class imbalance, and it’s one of the main challenges of fraud detection.
- Concealed: Fraudsters will also try their best to blend in and conceal their activities.
- Change over time: Fraudsters will find new methods to avoid getting caught and change their behaviors over time.
- Organized: Fraudsters oftentimes work together and organize their activities in a network, making it harder to detect.
How does a company deal with fraud?
- Use rules-based systems,based on manually set thresholds and experience to filter out strange cases.
- Check the news: the fraud analytics team check the news for suspicious names.
- Receive external lists of fraudulent accounts and names: keep track of the external hits lists from the police to reference check against the client base.
- Use machine learning algorithms to detect suspicious names and behaviors.
- Combine different strategies and various models together to avoid sub-par detection results since organized crime schemes are so sophisticated and quick to adapt.
Machine Learning in Fraud Detection
Traditional rules-based expert systems are not enough to catch fraud. They can do an excellent job of uncovering known patterns; but alone aren’t very effective at uncovering unknown schemes, adapting to new fraud patterns, or handling fraudsters’ increasingly sophisticated techniques. And this is where machine learning becomes necessary for fraud detection.
Many in the financial services industry have updated their fraud detection to include some basic machine learning algorithms including various clustering classifiers, linear approaches, and support vector machines. The most advanced companies in the financial services industry, such as PayPal, have been pioneering more advanced artificial intelligence techniques such as deep neural networks and autoencoders. When building a machine-learning model suite for fraud detection, it is very important not only to identify bad activity(high true positive rate) but also to allow good transactions to go through(low false positive rate).
Supervised and Unsupervised Machine Learning
- Supervised Machine Learning: A model that is trained on a set of properly “labeled” transactions. Each transaction is tagged as either fraud or non-fraud. Supervised machine learning model accuracy is directly correlated with the amount of clean, relevant training data. Common supervised machine learning methods include linear regression, logistic regression,
- Unsupervised Machine Learning: A model that is trained in cases where tagged transaction data is relatively thin or non-existent. Unsupervised models are designed to discover outliers that represent previously unseen forms of fraud. In the real world of fraud detection, well labeled data is very rare. Therefore supervised machine learning methods alone can not do a good job and unsupervised learning will play an important role in the war.
Model Evaluation in Credit Card Fraud Detection
- Accuracy isn’t everything. When working with highly imbalanced data, accuracy is not a reliable performance metric. Because by doing nothing but just predicting everything is in the maority class, you can obtain a higher accuracy than by building a predictive model.
- Precision: true positives / (true positives + false positives)
- Recall : true positives / (true positives +false negatives)
- F1-score: 2 x Precision x Recall / (Precision + Recall) = 2 x TP / (2 x TP + FP + FN)
- A credit card company wants to watch as much fraud as possible(reduce false negatives) as fraudulent transactios can be very costly and a false alarm means someone’s transaction is blocked(reduce false positives). The credit card company therefore wants to optimize recall. F-score takes into account a balance between precision and recall.
- Precision-Recall Curve(PR): Precision vs. Recall at various threshold settings.
- Average Precision(AP):
- Receiver Operating Characteristic Curve(ROC): Ture positive rate vs. False positive rate at various shreshold settings. It’s useful to compare performance of different algorithms for fraud detection.
- Area Under the Receiver Operating Characteristic Curve (ROCAUC):
- The AUROC answers the question: “How well can this classifier be expected to perform in general, at a variety of different baseline probabilities?” but precision and recall don’t.
- Confusion Matrix: shows how many fraud cases you can predict correctly.
- Classification Report: tells you about the precision and recall of your model.
Map of This Project
This project will use several different machine learning algorithms, from logistic regression to one of more advanced techniques, autoencoders.
Module 1: Data Exploration
Module 2: Resampling for Imbalanced Data
Module 3: Logistic Regression
Module 4: Decision Tree Classifier
Module 5: Random Forest Classifier
Module 6: Voting Classifier
Module 7: K-means Clustering
Module 9: BDS
Module 9: Autoencoder Neural Networks
Dataset Context
- The data contains 284,807 European credit card transactions with 492 fraudulent transactions that occurred over two days in September 2013.
- Everything except the time and amount has been reduced by a Principle Component Analysis (PCA) for privacy concerns. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.
- In order to implement a PCA transformation, features need to be previously scaled. So features V1, V2, … V28 have been scaled already.
- Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
- Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
- The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Given the class imbalance ratio, Area Under the Precision-Recall Curve (AUPRC) are recommend to measure the accuracy. Confusion matrix accuracy is not meaningful for unbalanced classification due to this Accuracy Paradox
Module 1: Data Exploration
Import modules, methods and our dataset
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
# Read csv
df = pd.read_csv("../input/creditcard.csv")
Check and visulaize Fraud to Non-fraud Ratio
# Explore the features avaliable in our dataframe
df.shape
df.info()
df.head()
df.describe()
print(df.Amount.describe())
# Count the occurrences of fraud and no fraud cases
fnf = df["Class"].value_counts()
# Print the ratio of fraud cases
print(fnf/len(df))
# Plottingg your data
plt.xlabel("Class")
plt.ylabel("Number of Observations")
fnf.plot(kind = 'bar',title = 'Frequency by observation number',rot=0)
# Plot how fraud and non-fraud cases are scattered
plt.scatter(df.loc[df['Class'] == 0]['V1'], df.loc[df['Class'] == 0]['V2'], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(df.loc[df['Class'] == 1]['V1'], df.loc[df['Class'] == 1]['V2'], label="Class #1", alpha=0.5, linewidth=0.15,c='r')
plt.show()
Dataset Summary:
- We have 284807 entries within 30 features and 1 target (Class).
- There are no “Null” values, so no need to work on ways to replace values.
- The mean of all the mounts made is relatively small, approximately USD 88.
- Most of the transactions were Non-Fraud (99.83%) of the time, while Fraud transactions occurs (017%) of the time in the dataframe.
Distribution of 2 Features : Time and Amount
import seaborn as sns
fig, ax = plt.subplots(1, 2, figsize=(18,4))
# Plot the distribution of 'Time' feature
sns.distplot(df['Time'].values/(60*60), ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Time', fontsize=14)
ax[0].set_xlim([min(df['Time'].values/(60*60)), max(df['Time'].values/(60*60))])
sns.distplot(df['Amount'].values, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Amount', fontsize=14)
ax[1].set_xlim([min(df['Amount'].values), max(df['Amount'].values)])
plt.show()
Summary:
- Time: Most transactions happended in day time.
- Mean of transaction amount is 88 USD and 75% quatile is 77 USD.
- We should better scale these two skewed features also.
Cut Up the Dataset into Two Datasets and Summarize
# Seperate total data into non-fraud and fraud cases
df_nonfraud = df[df.Class == 0] #save non-fraud df observations into a separate df
df_fraud = df[df.Class == 1] #do the same for frauds
Compare the Amount of transactions in two separate datasets
See if we can flag fraud cases by transaction amount
# Summarize statistics and see differences between fraud and normal transactions
print(df_nonfraud.Amount.describe())
print('_'*25)
print(df_fraud.Amount.describe())
# Import the module
from scipy import stats
F, p = stats.f_oneway(df['Amount'][df['Class'] == 0], df['Amount'][df['Class'] == 1])
print("F:", F)
print("p:",p)
Summary:
The mean transaction amout among fraud cases is 122 USD, and is 88 among non-fraud cases. And the difference is statistically significant.
Transaction Amount Visualization
Expect a lot of low-value transactions to be uninteresting (buying cups of coffee, lunches, etc).
Only visualizes the transactions between USD 200 and 2000.
# Plot of high value transactions($200-$2000)
bins = np.linspace(200, 2000, 100)
plt.hist(df_nonfraud.Amount, bins, alpha=1, normed=True, label='Non-Fraud')
plt.hist(df_fraud.Amount, bins, alpha=1, normed=True, label='Fraud')
plt.legend(loc='upper right')
plt.title("Amount by percentage of transactions (transactions \$200-$2000)")
plt.xlabel("Transaction amount (USD)")
plt.ylabel("Percentage of transactions (%)")
plt.show()
Summary:
- In the long tail, fraud transaction happened more frequently.
- It seems It would be hard to differentiate fraud from normal transactions by transaction amount alone.
Transaction Hour
Let’s look at the transaction percentage from day 0 to the next day.
# Plot of transactions in 48 hours
bins = np.linspace(0, 48, 48) #48 hours
plt.hist((df_nonfraud.Time/(60*60)), bins, alpha=1, normed=True, label='Non-Fraud')
plt.hist((df_fraud.Time/(60*60)), bins, alpha=0.6, normed=True, label='Fraud')
plt.legend(loc='upper right')
plt.title("Percentage of transactions by hour")
plt.xlabel("Transaction time from first transaction in the dataset (hours)")
plt.ylabel("Percentage of transactions (%)")
plt.show()
Hour “zero” corresponds to the hour the first transaction happened and not necessarily 12-1am.
Given the heavy decrease in normal transactions from hours 1 to 8 and again roughly at hours 24 to 32, it seems fraud tends to occur at higher rates during the night.
Statistical tests could be used to give evidence for this fact.
Transaction Amount vs. Hour
# Plot of transactions in 48 hours
plt.scatter((df_nonfraud.Time/(60*60)), df_nonfraud.Amount, alpha=0.6, label='Non-Fraud')
plt.scatter((df_fraud.Time/(60*60)), df_fraud.Amount, alpha=0.9, label='Fraud')
plt.title("Amount of transaction by hour")
plt.xlabel("Transaction time as measured from first transaction in the dataset (hours)")
plt.ylabel('Amount (USD)')
plt.legend(loc='upper right')
plt.show()
It is not enough to make a good classifier.
For example, it would be hard to draw a line that cleanly separates fraud and non-fraud transactions.
Feature Scaling
As we know before, features V1-V28 have been transformed by PCA and scaled already. Whereas feature “Time” and “Amount” have not. And considering that we will analyze these two features with other V1-V28, they should better be scaled before we train our model using various algorithms. Here is why and how.
Which scaling mehtod should we use?
The Standard Scaler is not recommended as “Time” and “Amount” features are not normally distributed.
The Min-Max Scaler is also not recommende as there are noticeable outliers in feature “Amount”.
The Robust Scaler are robust to outliers: (xi–Q1(x))/( Q3(x)–Q1(x)) (Q1 and Q3 represent 25% and 75% quartiles).
So we choose Robust Scaler to scale these two features.
# Scale "Time" and "Amount"
from sklearn.preprocessing import StandardScaler, RobustScaler
df['scaled_amount'] = RobustScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = RobustScaler().fit_transform(df['Time'].values.reshape(-1,1))
# Make a new dataset named "df_scaled" dropping out original "Time" and "Amount"
df_scaled = df.drop(['Time','Amount'],axis = 1,inplace=False)
df_scaled.head()
Correlation Matrices
Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud.
# Calculate pearson correlation coefficience
corr = df_scaled.corr()
# Plot heatmap of correlation
f, ax = plt.subplots(1, 1, figsize=(24,20))
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20})
ax.set_title("Imbalanced Correlation Matrix \n (don't use for reference)", fontsize=24)
Module 2: Resampling for Imbalanced Data
There are two types of resampling methods to deal with imbalanced data, one is under sampling and another one is over sampling.
- Under sampling: you take ramdom draws from non-fraud observations to match the amount of fraud observations. But you’re randomly throwing away a lot of data and infromation. aka: Random Under Sampling
- Over sampling: you take ramdom draws from frad cases and copy these observations to increase to amount of fraud samples in your data. But you are traning your model many many duplicates. aka: Random Over Sampling & SMOTE
- Synthetic Minority Oversampling Technique(SMOTE): Adjust the data imbalance by oversampling the monority observations(fraud cases) using nearest neighbors of fraud cases to create new synthetic fraud cases instead of just coping the monority samples.
- There is a common mistake when doing resampling, that is testing your model on the oversampled or undersampled dataset. If we want to implement cross validation, remember to split your data into training and testing before oversample or undersample and then just oversample or undersample the training part.
Another way to avoid this is to use “Pipeline” method.
Extract features from our scaled dataset “df_scaled”
# Define the prep_data function to extrac features
def prep_data(df):
X = df.drop(['Class'],axis=1, inplace=False) #
X = np.array(X).astype(np.float)
y = df[['Class']]
y = np.array(y).astype(np.float)
return X,y
# Create X and y from the prep_data function
X, y = prep_data(df_scaled)
Resample data with RUS, ROS and SMOTE
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.pipeline import Pipeline # Inorder to avoid testing model on sampled data
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Define the resampling method
undersam = RandomUnderSampler(random_state=0)
oversam = RandomOverSampler(random_state=0)
smote = SMOTE(kind='regular',random_state=0)
borderlinesmote = BorderlineSMOTE(kind='borderline-2',random_state=0)
# resample the training data
X_undersam, y_undersam = undersam.fit_sample(X_train,y_train)
X_oversam, y_oversam = oversam.fit_sample(X_train,y_train)
X_smote, y_smote = smote.fit_sample(X_train,y_train)
X_borderlinesmote, y_borderlinesmote = borderlinesmote.fit_sample(X_train,y_train)
Module 3: Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Fit a logistic regression model to our data
model = LogisticRegression()
model.fit(X_train, y_train)
# Obtain model predictions
y_predicted = model.predict(X_test)
Model Evaluation
from sklearn.metrics import roc_curve,roc_auc_score, precision_recall_curve, average_precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create true and false positive rates
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, y_predicted)
# Calculate Area Under the Receiver Operating Characteristic Curve
probs = model.predict_proba(X_test)
roc_auc = roc_auc_score(y_test, probs[:, 1])
print('ROC AUC Score:',roc_auc)
# Obtain precision and recall
precision, recall, thresholds = precision_recall_curve(y_test, y_predicted)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Define a roc_curve function
def plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc):
plt.plot(false_positive_rate, true_positive_rate, linewidth=5, label='AUC = %0.3f'% roc_auc)
plt.plot([0,1],[0,1], linewidth=5)
plt.xlim([-0.01, 1])
plt.ylim([0, 1.01])
plt.legend(loc='upper right')
plt.title('Receiver operating characteristic curve (ROC)')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
# Define a precision_recall_curve function
def plot_pr_curve(recall, precision, average_precision):
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
plt.show()
# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))
# Plot the roc curve
plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc)
# Plot recall precision curve
plot_pr_curve(recall, precision, average_precision)
Accuracy score= 99.92% which is higher than the baseline 99.83%.
Precision = 91/(12+91) = 0.88. The rate of true positive in all positive cases.
Recall = 91/ (56+91) = 0.62. The rate of true positive in all true cases.
F1-score = 0.73
False positives cases = 12.
Logistic Regression with Resampled Data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Resample your training data
rus = RandomUnderSampler()
ros = RandomOverSampler()
smote = SMOTE(kind='regular',random_state=5)
blsmote = BorderlineSMOTE(kind='borderline-2',random_state=5)
X_train_rus, y_train_rus = rus.fit_sample(X_train,y_train)
X_train_ros, y_train_ros = ros.fit_sample(X_train,y_train)
X_train_smote, y_train_smote = smote.fit_sample(X_train,y_train)
X_train_blsmote, y_train_blsmote = blsmote.fit_sample(X_train,y_train)
# Fit a logistic regression model to our data
rus_model = LogisticRegression().fit(X_train_rus, y_train_rus)
ros_model = LogisticRegression().fit(X_train_ros, y_train_ros)
smote_model = LogisticRegression().fit(X_train_smote, y_train_smote)
blsmote_model = LogisticRegression().fit(X_train_blsmote, y_train_blsmote)
y_rus = rus_model.predict(X_test)
y_ros = ros_model.predict(X_test)
y_smote = smote_model.predict(X_test)
y_blsmote = blsmote_model.predict(X_test)
print('Classifcation report:\n', classification_report(y_test, y_rus))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_rus))
print('*'*25)
print('Classifcation report:\n', classification_report(y_test, y_ros))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_ros))
print('*'*25)
print('Classifcation report:\n', classification_report(y_test, y_smote))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_smote))
print('*'*25)
print('Classifcation report:\n', classification_report(y_test, y_blsmote))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_blsmote))
print('*'*25)
Logistic Regression with sampled Data using Pipeline
# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Define which resampling method and which ML model to use in the pipeline
resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2')
model = LogisticRegression()
# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])
# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data
pipeline.fit(X_train, y_train)
y_predicted = pipeline.predict(X_test)
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_predicted))
As you can see, with the BorderlineSMOTE resampling method, we can get the best f1-score 0.15 compared with other 3 reampling methods. Not in all cases does resampling necessarily lead to better results.
When the fraud cases are very spread and scattered over the data, using SMOTE can introduce a bit of bias.
Nearest neighbors aren’t necessarily also fraud cases, so the synthetic samples might ‘confuse’ the model slightly.
Module 4: Decision Tree Classifier
# Import the decision tree model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Fit a logistic regression model to our data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Obtain model predictions
y_predicted = model.predict(X_test)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Obtain precision and recall
precision, recall, _ = precision_recall_curve(y_test, y_predicted)
# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)
# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))
Precision = 113/(113+25) = 0.82. The rate of true positive in all positive cases.
Recall = 113/ (113+34) = 0.77. The rate of true positive in all true cases.
F1-score = 0.79 False positives cases = 31.
Decision Tree Classifier with SMOTE Data
# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import BorderlineSMOTE
# Define which resampling method and which ML model to use in the pipeline
resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2')
model = DecisionTreeClassifier()
# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Decision Tree Classifier', model)])
# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data
pipeline.fit(X_train, y_train)
y_predicted = pipeline.predict(X_test)
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_predicted))
Precision = 0.63. The rate of true positive in all positive cases.
Recall = 0.71. The rate of true positive in all true cases.
F1-score = 0.66
False positives cases = 62.
Module 5: Random Forest Classifier
# Import the Random Forest Classifier model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Fit a logistic regression model to our data
model = RandomForestClassifier(random_state=5)
model.fit(X_train, y_train)
# Obtain model predictions
y_predicted = model.predict(X_test)
# Predict probabilities
probs = model.predict_proba(X_test)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Obtain precision and recall
precision, recall, _ = precision_recall_curve(y_test, y_predicted)
# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)
# Print the classifcation report and confusion matrix
print(accuracy_score(y_test, y_predicted))
print("AUC ROC score: ", roc_auc_score(y_test, probs[:,1]))
print('Classification report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))
Accuracy score = Precision = 0.95. The rate of true positive in all positive cases.
Recall = 0.73. The rate of true positive in all true cases.
F1-score = 0.83 False positives cases = 6, which is much better.
Random Forest Classifier with SMOTE Data Catch Fraud
# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import BorderlineSMOTE
# Define which resampling method and which ML model to use in the pipeline
resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2')
model = RandomForestClassifier()
# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Random Forest Classifier', model)])
# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data
pipeline.fit(X_train, y_train)
y_predicted = pipeline.predict(X_test)
# Predict probabilities
probs = model.predict_proba(X_test)
print(accuracy_score(y_test, y_predicted))
print("AUC ROC score: ", roc_auc_score(y_test, probs[:,1]))
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_true = y_test, y_pred = y_predicted))
Random Forest Classifier Model adjustments
# Import the Random Forest Classifier model from sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Define the model with balanced subsample
model = RandomForestClassifier(bootstrap=True,
class_weight={0:1, 1:12}, # 0: non-fraud , 1:fraud
criterion='entropy',
max_depth=10, # Change depth of model
min_samples_leaf=10, # Change the number of samples in leaf nodes
n_estimators=20, # Change the number of trees to use
n_jobs=-1,
random_state=5)
# Fit your training model to your training set
model.fit(X_train, y_train)
# Obtain the predicted values and probabilities from the model
y_predicted = model.predict(X_test)
# Calculate probs
probs = model.predict_proba(X_test)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Obtain precision and recall
precision, recall, _ = precision_recall_curve(y_test, y_predicted)
# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)
# Print the roc auc score, the classification report and confusion matrix
print("auc roc score: ", roc_auc_score(y_test, probs[:,1]))
print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n', confusion_matrix(y_test, y_predicted))
The model results don’t improve drastically.
If we mostly care about catching fraud, and not so much about the false positives, this does actually not improve our model at all, albeit a simple option to try.
By smartly defining more options in the model, you can obtain better predictions.
GridSearchCV to find optimal parameters for Random Forest Classifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define the parameter sets to test
param_grid = {
'n_estimators': [1, 30],
'max_features': ['auto', 'log2'],
'max_depth': [4, 8],
'criterion': ['gini', 'entropy']
}
# Define the model to use
model = RandomForestClassifier(random_state=5)
# Combine the parameter sets with the defined model
CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='recall', n_jobs=-1)
# Fit the model to our training data and obtain best parameters
CV_model.fit(X_train, y_train)
CV_model.best_params_
Model results using GridSearchCV
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Build a RandomForestClassifier using the GridSearchCV parameters
model = RandomForestClassifier(bootstrap=True,
class_weight = {0:1,1:12},
criterion = 'entropy',
n_estimators = 30,
max_features = 'auto',
min_samples_leaf = 10,
max_depth = 8,
n_jobs = -1,
random_state = 5)
# Fit the model to your training data and get the predicted results
model.fit(X_train,y_train)
y_predicted = model.predict(X_test)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Obtain precision and recall
precision, recall, _ = precision_recall_curve(y_test, y_predicted)
# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)
# Print the roc_auc_score,Classifcation report and Confusin matrix
probs = model.predict_proba(X_test)
print('roc_auc_score:', roc_auc_score(y_test,probs[:,1]))
print('Classification report:\n',classification_report(y_test,y_predicted))
print('Confusion_matrix:\n',confusion_matrix(y_test,y_predicted))
The results of this model just does not perform better.
Module 6: Voting Classifier
# Import modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Define the three classifiers to use in the ensemble
clf1 = LogisticRegression(class_weight={0:1,1:15},random_state=5)
clf2 = RandomForestClassifier(class_weight={0:1,1:12},
criterion='entropy',
max_depth=10,
max_features='auto',
min_samples_leaf=10,
n_estimators=20,
n_jobs=-1,
random_state=5)
clf3 = DecisionTreeClassifier(class_weight='balanced',random_state=5)
# Combine the classifiers in the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], voting='hard')
# Fit the model to your training data and get the predicted results
ensemble_model.fit(X_train,y_train)
y_predicted = ensemble_model.predict(X_test)
# print roc auc score , Classification report and Confusion matrix of the model
print('Classifier report:\n',classification_report(y_test,y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_test,y_predicted))
By combining the classifiers, you can take the best of multiple models.
By combining these together you indeed managed to improve performance.
# Adjust weights within the Voting Classifier
# Define the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
voting='soft',
weights=[1, 4, 1],
flatten_transform=True)
# Fit the model to your training data and get the predicted results
ensemble_model.fit(X_train,y_train)
y_predicted = ensemble_model.predict(X_test)
# Calculate average precision
average_precision = average_precision_score(y_test, y_predicted)
# Obtain precision and recall
precision, recall, _ = precision_recall_curve(y_test, y_predicted)
# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)
# print roc auc score , Classification report and Confusion matrix of the model
print('Classifier report:\n',classification_report(y_test,y_predicted))
print('Confusion matrix:\n',confusion_matrix(y_test,y_predicted))
ensemble_model.estimators_
The weight option allows you to play with the individual models to get the best final mix for your fraud detection model.
But the model performance does not improve.
Module 7: KMeans Clustering
Prepare unlabeled train and test dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
# Split the data into train set and test set
train,test = train_test_split(df,test_size=0.3,random_state=0)
# Get the arrays of features and labels in train dataset
features_train = train.drop(['Time','Class'],axis=1)
features_train = features_train.values
labels_train = pd.DataFrame(train[['Class']])
labels_train = labels_train.values
# Get the arrays of features and labels in test dataset
features_test = test.drop(['Time','Class'],axis=1)
features_test = features_test.values
labels_test = pd.DataFrame(test[["Class"]])
labels_test = labels_test.values
# Normalize the features in both train and test dataset
features_train = normalize(features_train)
features_test = normalize(features_test)
Build the model
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
model = KMeans(n_clusters=2,random_state=0)
model.fit(features_train)
labels_train_predicted = model.predict(features_train)
labels_test_predicted = model.predict(features_test)
# Decide if model predicted label is aligned with true label
true_negative,false_positive,false_negative,true_positive = confusion_matrix(labels_train,labels_train_predicted).ravel()
reassignflag = true_negative + true_positive < false_positive + false_negative
print(reassignflag)
labels_test_predicted = 1- labels_test_predicted
Model Evaluation
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score,f1_score
# Calculating confusion matrix for kmeans
print('Confusion Matrix:\n',confusion_matrix(labels_test,labels_test_predicted))
# Scoring kmeans
print('kmeans_precison_score:', precision_score(labels_test,labels_test_predicted))
print('kmeans_recall_score:', recall_score(labels_test,labels_test_predicted))
print('kmeans_accuracy_score:', accuracy_score(labels_test,labels_test_predicted))
print('kmeans_f1_score:',f1_score(labels_test,labels_test_predicted))
We can detect 91 out of 147 fraud cases in the test dataset.
But there are 17361 false positive cases which indicated that our KMeans model needs to be improved by selecting good features.
Module 8: MiniBatchKMeans Clustering
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
# Split the data into train set and test set
train,test = train_test_split(df,test_size=0.3,random_state=0)
# Get the arrays of features and labels in train dataset
features_train = train.drop(['Time','Class'],axis=1)
features_train = features_train.values
labels_train = pd.DataFrame(train[['Class']])
labels_train = labels_train.values
# Get the arrays of features and labels in test dataset
features_test = test.drop(['Time','Class'],axis=1)
features_test = features_test.values
labels_test = pd.DataFrame(test[["Class"]])
labels_test = labels_test.values
# Normalize the features in both train and test dataset
features_train = normalize(features_train)
features_test = normalize(features_test)
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import confusion_matrix
model = MiniBatchKMeans(n_clusters=2,random_state=0)
model.fit(features_train)
labels_train_predicted = model.predict(features_train)
labels_test_predicted = model.predict(features_test)
# Decide if model predicted label is aligned with true label
true_negative,false_positive,false_negative,true_positive = confusion_matrix(labels_train,labels_train_predicted).ravel()
reassignflag = true_negative + true_positive < false_positive + false_negative
print(reassignflag)
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score,f1_score
# Calculating confusion matrix for kmeans
print('Confusion Matrix:\n',confusion_matrix(labels_test,labels_test_predicted))
# Scoring kmeans
print('kmeans_precison_score:', precision_score(labels_test,labels_test_predicted))
print('kmeans_recall_score:', recall_score(labels_test,labels_test_predicted))
print('kmeans_accuracy_score:', accuracy_score(labels_test,labels_test_predicted))
print('kmeans_f1_score:',f1_score(labels_test,labels_test_predicted))
We can detect 91 out of 147 fraud cases in the test dataset.
But there are 17341 false positive cases which indicated that our MiniBatchKMeans model needs to be improved by selecting good features.
Module 9: Autoencoders
Prepare training data and testing data
First, drop the “Time” column (not going to use it as it is unimportant) and use the scikit’s StandardScaler on the Amount. The scaler removes the mean and scales the values to unit variance.
Autoencoder is gonna be a bit different from what we are used to. We will create this situation by training our model on the normal transactions, only.
Reserve 30% of our data for testing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Make another copy of df and drop the unimportant "Time" feature
data = df.drop(['Time'], axis=1)
# Use scikit’s StandardScaler on the "Amount" feature
# The scaler removes the mean and scales the values to unit variance
data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# Create the training and testing sets
X1_train, X1_test = train_test_split(data, test_size=.3, random_state=0)
X1_train = X1_train[X1_train.Class == 0] # train the model on normal transactions
X1_train = X1_train.drop(['Class'], axis=1)
y1_test = X1_test['Class']
X1_test = X1_test.drop(['Class'], axis=1) #drop the class column
#transform to ndarray
X1_train = X1_train.values
X1_test = X1_test.values
X1_train.shape
Build the Autoencoder Model
Use 4 fully connected layers with 14, 7, 7 and 29 neurons respectively for this Autoencoder model.
The first two layers are used for encoder, the last two go for the decoder.
Use L1 regularization during training
import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers
input_dim = X1_train.shape[1] #num of columns, 29
encoding_dim = 14
hidden_dim = int(encoding_dim / 2)
learning_rate = 1e-5
input_layer = Input(shape=(input_dim, ))
encoder = Dense(encoding_dim,
activation="tanh",
activity_regularizer=regularizers.l1(learning_rate))(input_layer)
encoder = Dense(hidden_dim, activation="relu")(encoder)
decoder = Dense(hidden_dim, activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
Train the Autoencoder Model
Train our model for 100 epochs with a batch size of 128 samples and save the best performing model to a file.
The ModelCheckpoint provided by Keras is really handy for such tasks.
Additionally, the training progress will be exported in a format that TensorBoard understands.
nb_epoch = 100
batch_size = 128
autoencoder.compile(metrics=['accuracy'],
loss='mean_squared_error',
optimizer='adam')
checkpointer = ModelCheckpoint(filepath='autoencoder_fraud.h5',
save_best_only=True,
verbose=0)
tensorboard = TensorBoard(log_dir='./logs',
histogram_freq=0,
write_graph=True,
write_images=True)
history = autoencoder.fit(X1_train, X1_train,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_data=(X1_test, X1_test),
verbose=1,
callbacks=[checkpointer, tensorboard]).history
load_model('autoencoder_fraud.h5')
Model Evaluation
The model seems to be performing well enough, although there is significant room for improvement by adding more hidden layers.
More hidden layers would allow this network to encode more complex relationships between the input features. The loss of our current model seems to be converging and more training epochs are not likely going to help.
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
Reconstruction Error
Autoencoders are trained to reduce reconstruction error.
predictions = autoencoder.predict(X1_test)
mse = np.mean(np.power(X1_test - predictions, 2), axis=1)
df_error = pd.DataFrame({'reconstruction_error': mse,
'true_class': y1_test})
df_error.describe()
ROC Curve
Since we have an imbalanced data set, Receiver Operating Characteristic Curves are not that useful although it’s an expected output of most binary classifiers.
Because you can generate a pretty good-looking curve by just simply guessing each one is the non-fraud case. But let’s have a look at ROC curve first.
# Import modules
from sklearn.metrics import auc, roc_curve,precision_recall_curve
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import recall_score,f1_score,precision_recall_fscore_support
false_positive_rate, true_positive_rate, thresholds = roc_curve(df_error.true_class, df_error.reconstruction_error)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Plot the roc curve
plot_roc_curve(false_positive_rate,true_positive_rate,roc_auc)
The ROC curve plots the true positive rate versus the false positive rate, over different threshold values.
We basically want the blue line to be as close as possible to the upper left corner.
As our dataset is quite imbalanced, ROC doesn’t look very useful for us even though the results look pretty good.
Recall vs. Precision
Considering the imbalance of our dataset, we take a look at the Recall vs. Precision trade off.
precision, recall, thresholds = precision_recall_curve(df_error.true_class, df_error.reconstruction_error)
# Plot recall precision tradeoff
plt.plot(recall, precision, linewidth=5, label='Precision-Recall curve')
plt.title('Recall vs Precision')
plt.xlabel('Recall')
plt.ylabel('Precision')
print(plt.show())
# Plot precision and recall for different thresholds
plt.plot(thresholds, precision[1:], label="Precision",linewidth=5)
plt.plot(thresholds, recall[1:], label="Recall",linewidth=5)
plt.title('Precision and recall for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.legend()
print(plt.show())
Prediction
In order to predict whether or not a new transaction is normal or fraudulent, we’ll calculate the reconstruction error from the transaction data itself.
If the error is larger than a predefined threshold, we’ll mark it as a fraud (since our model should have a low error on normal transactions).
# Set a threshold
set_threshold = 5
groups = df_error.groupby('true_class')
fig, ax = plt.subplots()
for name, group in groups:
ax.plot(group.index,
group.reconstruction_error,
marker='o',
ms=3.5,
linestyle='',
label= "Fraud" if name == 1 else "Nonfraud")
ax.hlines(set_threshold,
ax.get_xlim()[0],
ax.get_xlim()[1],
colors="r",
zorder=100,
label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show()
Confusion Matrix
y_pred = [1 if e > set_threshold else 0 for e in df_error.reconstruction_error.values]
print('Confusion_matrix:\n',confusion_matrix(df_error.true_class, y_pred))




















