AI-Supported Credit Card Fraud Detection

Our process

Use Machine Learning To
Identify a Fraudulent Bank Transaction

Understand our Dataset

With Python, we used Pandas and visualization libraries to conduct an Exploratory Data Analysis (EDA) to find meaningful relationships in our data.

Clean our Dataset and Find Correlations

After preforming our EDA, we distilled our data set to six features and one target. Depending on our current sample, various rows were removed due to duplicates or NaN data.

Code Machine Learning Models and Produce Conclusions

We split our data set and successfully trained it on ten different classification models supported by the Sci Kit Learn library, accurately creating a model to predict fraud in each.

Size of Original Data Set

Models

Hours Of Work

Hard Workers

Features

Features Used for Predictions

amount

oldbalanceOrg

newbalanceOrig

oldbalanceDest

newbalanceDest

type

Libraries used for conducting Exploratory Data Analysis

Pandas
Plotly

With Python's data analysis library: Pandas, we read in our dataset with the "read_csv" method into a data frame. Afterwards, we manipulated our data frame to solely hold features relevant to our model learning.

Syntax

import pandas as pd

Methods/Class Variables:

pd.read_csv()
df.head()
df.info()
df.drop()
df.dropna()
df.reset_index()
df.columns
df.dtypes

Plotly express allows us to us to create multiple visualizations for representing different columns in our dataset.

Syntax

import plotly.express as pt

Methods/Class Variables:

pt.imshow()
pt.histogram()
pt.pie()
pt.scatter()
pt.density_heatmap()
pt.imshow()
pt.box

Visualizations

What do you see in our data?

All
Scatter
Histogram
Heat Maps
Pie Charts
Box Plots

Scatter Plot of Before and After Balance

Scatter Plot for before and after balance of the sender

Scatter Plot of Before and After Balance Based on Transaction Type

Scatter Plot for before and after balance of the sender based on transaction type

Scatter Plot of Before and After

Correlation Heatmap Before Removing Un-wanted Columns

Scatter Plots of finding Frauds

Scatter plot of the types of payments in relation to whether it is fraud or not

Density Heatmap of isFraud

A density heatmap of the isFruad column

Histogram of Transaction Types

A histogram representing all the types of transactions made in the data set

Pie Chart of Fraud Cases vs Payment Types

A pie chart of types of payment related to all Fraud cases

Pie Chart Types of Transactions

A pie chart of all the types of transactions

Box Plot of Amount Transactions

a box plot of the amount of all the transactions

Scatter Matrix of Amount for Destination

A scatter matrix of the amount, oldbalanceDest, and newbalanceDest.

Machine Learning Models

The following are models we have used to train our dataset and conduct predictions

Support Vector Models

Support Vector Machines takes on a classification approach and the model is built on a multidimensional space which helps separate different classes. The objective is to segregate a data set into a hyperplane, and fit the data while having the highest margin.

K-Nearest Neighbors

K-NN is a supervised Machine Learning algorithm that predicts similarity between new data and available data and can be used for regression and classification.

Random Forest Classification

Random Forest creates decision trees on selected data, collects a prediction from each tree, and then selects the prediction with the most votes (occurrences) as the final prediction.

XGBoost

Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems.

Logistic Regression

A logistic regression model predicts a dependent variable (binary outcome) by analyzing the relationship between multiple independent variables.

Naive Bayes

Naive Bayes is a classification model which calculates the probability of given class labels then finds the likelihood probability of each attribute. The model then repeatedly inputs these values into Bayes formula, choosing the class with the highest probability.

Conclusions

What we got out of each model:

In general, many of these top machine learning models yielded very similar results with extremely high accuracies that the differences were pretty negligible.

Overall Result

Conclusion

The Support Vector Machine Learning model has an accuracy score of 0.99939984, roc_auc_score of 0.666666. SVM yielded 19,977 True Positives,0 False Positive, 12 True Negative, and 6 False Negative.

Support Vector Machine Learning

Conclusion

K-Nearest-Neighbors has an accuracy score of 0.9994498624656164, roc_auc_score of 0.69444444. The model yielded 11 False Negatives, 7 True Negatives, and 19977 True Positives, and 0 False Positive.

K-Nearest Neighbors

Conclusion

XGBoost has an accuracy score of 0.9998258345428157 and a roc_auc_score of 0.9285714285714286. The model yielded 3 False Negatives, 18 True Negatives, 17204 True Positives, and 0 False Positives.

XGBoost

Conclusion

The Random Forest Model has an accuracy score of around 0.999899 and a roc_auc_score of 0.972197. The model yielded 19,976 True Positives, 1 False Positive, 17 True Negative, and 1 False Negatives.

Random Forest Classification

Conclusion

Team

Our hard working team

Asad Shahid

Technical Expert and GitHub Enjoyer

No Pressure, No Diamonds

Engoron Lim

Decision-Maker and Debugger

Corgi is love. Corgi is life.

Jayden Cavanagh

Graphic Designer and Project Manager

Avogadro's number 6.02214076 x 10²³. Now why is the sky blue?

Ray Parikh

Frontend Developer and Design Lover

An apple a day keeps anyone away if you throw it hard enough.

Shruti Vora

Data Dissection

Purpose Fuels Passion

William Ruan

Communications and Organizer

They said to follow my dreams, so I went back to sleep.

Katherine Vo

Instructor and Cal Academic Advisor

A Weighted Blanket is not enough, I need to be compressed into a .zip file

AI-Supported Credit Card Fraud Detection

By the Parallel Processing team - AI Camp Data Science Batch B - 2022

Understand our Dataset

Clean our Dataset and Find Correlations

Code Machine Learning Models and Produce Conclusions

amount

oldbalanceOrg

newbalanceOrig

oldbalanceDest

newbalanceDest

type

Libraries used for conducting Exploratory Data Analysis

Syntax

Methods/Class Variables:

Syntax

Methods/Class Variables:

Scatter Plot of Before and After Balance

Scatter Plot of Before and After Balance Based on Transaction Type

Scatter Plot of Before and After

Scatter Plots of finding Frauds

Density Heatmap of isFraud

Histogram of Transaction Types

Pie Chart of Fraud Cases vs Payment Types

Pie Chart Types of Transactions

Box Plot of Amount Transactions

Scatter Matrix of Amount for Destination

Support Vector Models

K-Nearest Neighbors

Random Forest Classification

XGBoost

Logistic Regression

Naive Bayes

Overall Result

Conclusion

Support Vector Machine Learning

Conclusion

K-Nearest Neighbors

Conclusion

XGBoost

Conclusion

Random Forest Classification

Conclusion

Asad Shahid

Engoron Lim

Jayden Cavanagh

Ray Parikh

Shruti Vora

William Ruan

Katherine Vo

AI-Supported
Credit Card
Fraud Detection

By the Parallel Processing team
- AI Camp Data Science Batch B -
2022