Part I - (Exploration of Loan Data from Prosper)¶

by (Kevin Egunza)¶

Introduction¶

Loan Data from Prosper

  • This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.

Preliminary Wrangling¶

In [4]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [5]:
df_loanData = pd.read_csv('prosperLoanData.csv')
In [6]:
df_loanData.head()
Out[6]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [7]:
df_loanData['Term'].value_counts()
Out[7]:
36    87778
60    24545
12     1614
Name: Term, dtype: int64
  • 871 records in the dataset have been duplicated
  • ListingCreationDate and ClosedDate have incorrect data type
  • Term variable is more of a category data type than a numerical type
  • LoanStatus variable is more of a category data type than a object type representing a string
  • ListingCategory (numeric) is more of a category data type than a numerical type
In [8]:
#Check for duplicated records
df_loanData[df_loanData['ListingNumber'].duplicated() & df_loanData['ListingKey'].duplicated()]['ListingNumber'].size
Out[8]:
871
In [9]:
df_loanData.shape
Out[9]:
(113937, 81)

Data Cleaning¶

In [10]:
# Make a copy of the original dataframe for cleaning purposes
df_copy = df_loanData.copy()

Issue 1:¶

ListingCreationDate and ClosedDate have incorrect data type

Define:¶

  • Convert data types of ListingCreationDate and ClosedDate to datetime

Code¶

In [11]:
df_copy['ListingCreationDate'] = pd.to_datetime(df_copy['ListingCreationDate'])
df_copy['ClosedDate'] = pd.to_datetime(df_copy['ClosedDate'])

Test¶

In [12]:
df_copy[['ListingCreationDate', 'ClosedDate']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 2 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   ListingCreationDate  113937 non-null  datetime64[ns]
 1   ClosedDate           55089 non-null   datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 1.7 MB

Issue 2:¶

871 records in the dataset have been duplicated

Define:¶

  • Delete the 871 duplicated records

Code¶

In [13]:
index = df_copy[df_copy['ListingNumber'].duplicated()].index
df_copy.drop(index, axis='rows', inplace=True)

Test¶

In [14]:
df_copy[df_copy['ListingNumber'].duplicated()]
Out[14]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors

0 rows × 81 columns

Issue 3:¶

Term variable is more of a category data type than a numerical type

Define:¶

  • Convert the Term column data type to ordered category data type

Code¶

In [15]:
term_type = pd.CategoricalDtype(categories=[12, 36, 60], ordered=True)
df_copy['Term'] = df_copy['Term'].astype(term_type)

Test¶

In [16]:
df_copy['Term'].info()
<class 'pandas.core.series.Series'>
Int64Index: 113066 entries, 0 to 113936
Series name: Term
Non-Null Count   Dtype   
--------------   -----   
113066 non-null  category
dtypes: category(1)
memory usage: 993.9 KB

Issue 4:¶

LoanStatus variable is more of a category data type than a object type representing a string

Define:¶

  • Convert the LoanStatus column data type to category data type

Code¶

In [17]:
df_copy['LoanStatus']= df_copy['LoanStatus'].astype("category")

Test¶

In [18]:
df_copy['LoanStatus'].info()
<class 'pandas.core.series.Series'>
Int64Index: 113066 entries, 0 to 113936
Series name: LoanStatus
Non-Null Count   Dtype   
--------------   -----   
113066 non-null  category
dtypes: category(1)
memory usage: 994.1 KB

Issue 5:¶

ListingCategory (numeric) variable is more of a category data type than a object type representing a string

Define:¶

  • Convert the ListingCategory (numeric) column data type to category data type

Code¶

In [19]:
df_copy['ListingCategory (numeric)'] = df_copy['ListingCategory (numeric)'].astype('category')

Test¶

In [20]:
df_copy['ListingCategory (numeric)'].info()
<class 'pandas.core.series.Series'>
Int64Index: 113066 entries, 0 to 113936
Series name: ListingCategory (numeric)
Non-Null Count   Dtype   
--------------   -----   
113066 non-null  category
dtypes: category(1)
memory usage: 994.5 KB
In [22]:
df_copy['ProsperRating (numeric)'].info()
<class 'pandas.core.series.Series'>
Int64Index: 113066 entries, 0 to 113936
Series name: ProsperRating (numeric)
Non-Null Count  Dtype   
--------------  -----   
83982 non-null  category
dtypes: category(1)
memory usage: 994.1 KB

Issue 6:¶

ProsperRating (numeric) variable is more of a category data type than a float64.

Define:¶

  • Convert the ProsperRating (numeric) column data type to category data type

Code¶

In [23]:
def convert_prosperRating():
    """Convert the ProsperRating (numeric) column data type to ordered category data type"""
    prosper_type = pd.CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7], ordered=True)
    df_copy['ProsperRating (numeric)'] = df_copy['ProsperRating (numeric)'].astype(prosper_type)
In [24]:
#Convert the ProsperRating (numeric) column data type to ordered category data type
convert_prosperRating()

Test¶

In [24]:
df_copy['ProsperRating (numeric)'].info()
<class 'pandas.core.series.Series'>
Int64Index: 113066 entries, 0 to 113936
Series name: ProsperRating (numeric)
Non-Null Count  Dtype   
--------------  -----   
83982 non-null  category
dtypes: category(1)
memory usage: 994.1 KB
In [25]:
#Store cleaned data in csv file
df_copy.to_csv('clean_data.csv')

The structure of the dataset¶

The prosperLoanData Dataset has 113,937 rows and 81 columns.

The main feature of interest in the dataset¶

'LoanOriginalAmount'

  • Are there differences between loans depending on how large the original loan amount was?

Features in the dataset that will help support the investigation into the feature of interest¶

'Term', 'LoanStatus', 'BorrowerRate', 'BorrowerState', 'EmploymentStatus', 'LoanOriginalAmount', 'CreditScoreRangeLower', 'DebtToIncomeRatio', 'IncomeRange', 'AmountDelinquent', 'AvailableBankcardCredit'

Univariate Exploration¶

What is the distribution of the LoanOriginalAmount feature?¶

In [26]:
#plot a histogram to showcase the distribution of LoanOriginalAmount
bins = np.arange(0, df_copy['LoanOriginalAmount'].max()+10, 2000)
plt.hist(data=df_copy, x='LoanOriginalAmount', bins=bins);
plt.title('LoanOriginalAmount Histogram')
plt.xlabel('LoanOriginalAmount')
plt.ylabel('Frequency');
  • The distribution of the LoanOriginalAmount is right skewed with most loan amount between 2000 and 15000.

What is the distribution of the BorrowerRate feature?¶

In [27]:
#plot a histogram to showcase the distribution of BorrowerRate
bins = np.arange(0, df_copy['BorrowerRate'].max()+0.5, 0.05)
plt.hist(data=df_copy, x='BorrowerRate');
plt.title('BorrowerRate Histogram')
plt.xlabel('BorrowerRate')
plt.ylabel('Frequency');
  • The distribution of the BorrowerRate feature is right skewed with most rates between 0.1 and 0.3.

What is the distribution of the AmountDelinquent feature?¶

In [28]:
#plot a histogram to showcase the distribution of AmountDelinquent
plt.hist(data=df_copy, x='AmountDelinquent');
plt.title('AmountDelinquent Histogram')
plt.xlabel('AmountDelinquent')
plt.ylabel('Frequency');
In [41]:
print(len(df_copy))
len(df_copy[df_copy['AmountDelinquent']>100000])
113066
Out[41]:
112
  • The distribution of the AmountDelinquent feature is right skewed. The histogram also depict presence of outlier AmountDelinquent values. Out of 113066 values, 112 AmountDelinquent values range above 100000.
In [32]:
#plot a histogram to showcase the distribution of AmountDelinquent

bins = np.arange(0, df_copy['AmountDelinquent'].max()+5, 10000)
plt.hist(data=df_copy, x='AmountDelinquent',bins=bins)
plt.xlim(0, 100000)
plt.title('AmountDelinquent Histogram')
plt.xlabel('AmountDelinquent')
plt.ylabel('Frequency');
  • Zooming in on the distribution of the AmountDelinquent feature depicts that majority of the data points are between 0 and 20,000.

What is the distribution of the AvailableBankcardCredit feature?¶

In [23]:
#plot a histogram to showcase the distribution of AvailableBankcardCredit
bins = np.arange(0, df_copy['AvailableBankcardCredit'].max()+0.5, 0.05)
plt.hist(data=df_copy, x='AvailableBankcardCredit')
plt.title('AvailableBankcardCredit Histogram')
plt.xlabel('AvailableBankcardCredit')
plt.ylabel('Frequency');
In [39]:
print(len(df_copy))
len(df_copy[df_copy['AvailableBankcardCredit']>200000])
113066
Out[39]:
89
  • The distribution of the AvailableBankcardCredit feature is right skewed. The histogram also depict presence of outlier AvailableBankcardCredit values. Out of 113066 values, 89 AvailableBankcardCredit values range above 200000.
In [46]:
#plot a histogram to showcase the distribution of AvailableBankcardCredit
bins = np.arange(0, df_copy['AvailableBankcardCredit'].max(), 20000)
plt.hist(data=df_copy, x='AvailableBankcardCredit')
plt.xlim(0, 200000)
plt.title('AvailableBankcardCredit Histogram')
plt.xlabel('AvailableBankcardCredit')
plt.ylabel('Frequency');
  • Zooming in on the distribution of the AvailableBankcardCredit feature depicts that majority of the data points are between 0 and 60,000

What is the distribution of the CreditScoreRangeLower feature?¶

In [15]:
#plot a histogram to showcase the distribution of CreditScoreRangeLower
bins = np.arange(0, df_copy['CreditScoreRangeLower'].max()+10, 2000)
plt.hist(data=df_copy, x='CreditScoreRangeLower');
plt.title('CreditScoreRangeLower Histogram')
plt.xlabel('CreditScoreRangeLower')
plt.ylabel('Frequency');
  • The CreditScoreRangeLower feature has a bimodal distribution with most data points between 600 and 800. The histogram also depicts presence of outlier values between 0 and 100.

What is the distribution of the DebtToIncomeRatio feature?¶

In [24]:
#plot a histogram to showcase the distribution of DebtToIncomeRatio
bins = np.arange(0, df_copy['DebtToIncomeRatio'].max()+10, 2000)
plt.hist(data=df_copy, x='DebtToIncomeRatio');
plt.title('DebtToIncomeRatio Histogram')
plt.xlabel('DebtToIncomeRatio')
plt.ylabel('Frequency');
  • The distribution of the DebtToIncomeRatio feature is right skewed. The histogram also depict presence of outlier DebtToIncomeRatio values between 9 and 10.
In [52]:
#plot a histogram to showcase the distribution of DebtToIncomeRatio
bins = np.arange(0, df_copy['DebtToIncomeRatio'].max()+10, 2000)
plt.hist(data=df_copy, x='DebtToIncomeRatio')
plt.xlim(0, 2)
plt.title('DebtToIncomeRatio Histogram')
plt.xlabel('DebtToIncomeRatio')
plt.ylabel('Frequency');
  • Zooming in on the distribution of the DebtToIncomeRatio feature depicts that majority of the data points are between 0 and 1.

What is the distribution of the Term feature?¶

In [38]:
#plot a pie chart to showcase the distribution of Term of the loan
sorted_counts = df_copy['Term'].value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90, counterclock = False);
plt.axis('square')
plt.title('A pie chart showcasing the distribution of the loan Term');
  • Majority of the loans i.e more than 3/4 have a loan term of 36 months. The 12 months loan term category had the least number of loans with 1614 out of 113066.
In [48]:
print(len(df_copy))
df_copy['Term'].value_counts()
113066
Out[48]:
36    87224
60    24228
12     1614
Name: Term, dtype: int64

What is the distribution of the IncomeRange feature?¶

In [25]:
df_copy['IncomeRange'].unique()
Out[25]:
array(['$25,000-49,999', '$50,000-74,999', 'Not displayed', '$100,000+',
       '$75,000-99,999', '$1-24,999', 'Not employed', '$0'], dtype=object)
In [34]:
#Convert the IncomeRange column data type to ordered category data type
income_type = pd.CategoricalDtype(categories=['$0', 'Not employed', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+', 'Not displayed'], ordered=True)
df_copy['IncomeRange'] = df_copy['IncomeRange'].astype(income_type)
#plot a bar plot to showcase the distribution of IncomeRange
base_color = sns.color_palette()[0]
sns.countplot(data=df_copy, x='IncomeRange', color=base_color)
plt.xticks(rotation=90)
plt.ylabel('Frequency')
plt.title('A bar plot of IncomeRange distribution');
  • The $25,000-49,999 and $50,000-74,999 IncomeRange categories had the highest distributions whereas $0, and Not employed IncomeRange categories had the least distributions.

What is the distribution of the LoanStatus feature?¶

In [32]:
#plot a bar plot to showcase the distribution of LoanStatus
sns.countplot(data=df_copy, x="LoanStatus", color=base_color)
plt.xticks(rotation=90)
plt.title('A bar plot of LoanStatus distribution');
  • In regards to the LoanStatus feature most loans in this dataset are either current or completed with a few either chargedoff, defaulted, in final payment, past due or cancelled.
In [22]:
df_copy[df_copy['LoanStatus'] != 'Current']['LoanStatus'].value_counts()
Out[22]:
Completed                 38061
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        800
Past Due (31-60 days)       361
Past Due (61-90 days)       311
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      203
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64

What is the distribution of the EmploymentStatus feature?¶

In [35]:
#plot a bar plot to showcase the distribution of EmploymentStatus
sns.countplot(data=df_copy, x="EmploymentStatus", color=base_color)
plt.xticks(rotation=90)
plt.title('A bar plot of EmploymentStatus distribution');
  • In the EmploymentStatus feature, the Employed category has significant higher distribution compared to other EmploymentStatus categories followed by the Full-time category. The Not-employed, Part-time and Retired categories had the least distributions.

What is the distribution of the ProsperRating (numeric) feature?¶

In [36]:
#Convert the ProsperRating (numeric) column data type to ordered category data type
prosper_type = pd.CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7], ordered=True)
df_copy['ProsperRating (numeric)'] = df_copy['ProsperRating (numeric)'].astype(prosper_type)
#plot a bar plot to showcase the distribution of ProsperRating (numeric)
sns.countplot(data=df_copy, x="ProsperRating (numeric)", color=base_color)
plt.title('A bar plot of ProsperRating (numeric) distribution');
  • The distribution of ProsperRating (numeric) feature increases from the first and least rating to the middle rating and subsequently decreases from the miiddle to last and highest rating.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

  • The distribution of the LoanOriginalAmount is right skewed with most loan amount between 2000 and 15000.
  • The distribution of the BorrowerRate feature is right skewed with most rates between 0.1 and 0.3.
  • The distribution of the AmountDelinquent feature is right skewed. The histogram also depicts presence of outlier AmountDelinquent values. Out of 113066 values, 112 AmountDelinquent values range above 100000.
  • Zooming in on the distribution of the AmountDelinquent feature depicts that majority of the data points are between 0 and 20,000.
  • The distribution of the AvailableBankcardCredit feature is right skewed. The histogram also depict presence of outlier AvailableBankcardCredit values. Out of 113066 values, 89 AvailableBankcardCredit values range above 200000.
  • Zooming in on the distribution of the AvailableBankcardCredit feature depicts that majority of the data points are between 0 and 60,000.
  • The CreditScoreRangeLower feature has a bimodal distribution with most data points between 600 and 800. The histogram also depicts presence of outlier values between 0 and 100.
  • The distribution of the DebtToIncomeRatio feature is right skewed. The histogram also depict presence of outlier DebtToIncomeRatio values between 9 and 10.
  • Zooming in on the distribution of the DebtToIncomeRatio feature depicts that majority of the data points are between 0 and 1.
  • Majority of the loans i.e more than 3/4 have a loan term of 36 months. The 12 months loan term category had the least number of loans with 1614 out of 113066.
  • The $25,000-49,999 and $50,000-74,999 IncomeRange categories had the highest distributions whereas $0, and Not employed IncomeRange categories had the least distributions.
  • In regards to the LoanStatus feature most loans in this dataset are either current or completed with a few either chargedoff, defaulted, in final payment, past due or cancelled.
  • In the EmploymentStatus feature, the Employed category has significant higher distribution compared to other EmploymentStatus categories followed by the Full-time category. The Not-employed, Part-time and Retired categories had the least distributions.
  • The distribution of ProsperRating (numeric) feature increases from the first and least rating to the middle rating and subsequently decreases from the miiddle to last and highest rating.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

  • The AmountDelinquent feature histogram depicts presence of outlier AmountDelinquent values. Out of 113066 values, 112 AmountDelinquent values range above 100000. Zooming in on the distribution of the AmountDelinquent feature depicts that majority of the data points are between 0 and 20,000.
  • The AvailableBankcardCredit feature histogram depicts presence of outlier AvailableBankcardCredit values. Out of 113066 values, 89 AvailableBankcardCredit values range above 200000. Zooming in on the distribution of the AvailableBankcardCredit feature depicts that majority of the data points are between 0 and 60,000.
  • ProsperRating (numeric) variable is more of a category data type than a float64. therefore I converted the ProsperRating (numeric) column data type and plotted it as a category data type

Bivariate Exploration¶

How does the LoanOriginalAmount feature relate to BorrowerRate feature?¶

In [57]:
#plot a scatter plot of LoanOriginalAmount against BorrowerRate
plt.figure(figsize=(5,5))
plt.scatter(data=df_copy, x='BorrowerRate', y='LoanOriginalAmount', alpha=1/10)
plt.xlabel('BorrowerRate')
plt.ylabel('LoanOriginalAmount')
plt.title('Scatter Plot of BorrowerRate against LoanOriginalAmount');
  • In regards to the relation betweeen LoanOriginalAmount and BorrowerRate, few Loan Amounts of 3000 and below had BorrowerRate of greater than 0.4. Loan Amounts of above 25000 had BorrowerRate of below 0.2.

How does the LoanOriginalAmount feature relate to CreditScoreRangeLower feature?¶

In [56]:
#plot a scatter plot of LoanOriginalAmount against CreditScoreRangeLower
plt.figure(figsize=(5,5))
plt.scatter(data=df_copy, x='CreditScoreRangeLower', y='LoanOriginalAmount', alpha=1/10)
plt.xlim(300, 900)
plt.xlabel('CreditScoreRangeLower')
plt.ylabel('LoanOriginalAmount')
plt.title('Scatter Plot of CreditScoreRangeLower against LoanOriginalAmount');
  • Considering the relation between LoanOriginalAmount and CreditScoreRangeLower, most loan amounts CreditScoreRangeLower lie between 360 and 880. Loan Amounts of above 25000 associate with scores of 720 and above. A few Loan amounts have CreditScoreRangeLower of 0.

How does the LoanOriginalAmount feature relate to DebtToIncomeRatio feature?¶

In [55]:
#plot a scatter plot of LoanOriginalAmount against DebtToIncomeRatio
#plt.figure(figsize=(5,5))
plt.scatter(data=df_copy, x='DebtToIncomeRatio', y='LoanOriginalAmount', alpha=1/10)
plt.xlabel('DebtToIncomeRatio')
plt.ylabel('LoanOriginalAmount')
plt.title('Scatter Plot of DebtToIncomeRatio against LoanOriginalAmount');
  • Relating DebtToIncomeRatio to LoanOriginalAmount, we observe that loan amounts above 25000 have low DebtToIncomeRatio of not more than 0.4 compared to other lower loan amounts. Most Loan amounts, regardless of the amount, have a DebtToIncomeRatio of between 0 and 1.

How does the LoanOriginalAmount feature relate to AmountDelinquent feature?¶

In [54]:
#plot a scatter plot of LoanOriginalAmount against AmountDelinquent
#plt.figure(figsize=(5,5))
plt.scatter(data=df_copy, x='AmountDelinquent', y='LoanOriginalAmount', alpha=1/10)
plt.xlabel('AmountDelinquent')
plt.ylabel('LoanOriginalAmount')
plt.title('Scatter Plot of AmountDelinquent against LoanOriginalAmount');
  • Given that most loan amounts above 25,000 are current, only 2 were delinquent with 18,239 and 70,734. 3 loan amounts of 4000 stood out with AmountDelinquent of above 300,000 and 400,000
  • Most loan amounts have AmountDelinquent of less than 30,000

How does the LoanOriginalAmount feature relate to AvailableBankcardCredit feature?¶

In [53]:
#plot a scatter plot of LoanOriginalAmount against AvailableBankcardCredit
#plt.figure(figsize=(5,5))
plt.scatter(data=df_copy, x='AvailableBankcardCredit', y='LoanOriginalAmount', alpha=1/10)
plt.xlabel('AvailableBankcardCredit')
plt.ylabel('LoanOriginalAmount')
plt.title('Scatter Plot of AvailableBankcardCredit against LoanOriginalAmount');
  • Most LoanOriginalAmounts have AvailableBankcardCredit of less than 100,000. Compared to low amounts, LoanOriginalAmounts of above 25000 do not have AvailableBankcardCredit greater than 200,000.

How does the LoanOriginalAmount feature relate to EmploymentStatus feature?¶

In [30]:
#plot a box plot depicting relationship between EmploymentStatus and LoanOriginalAmount
base_color = sns.color_palette()[0]
sns.boxplot(data=df_copy, x='EmploymentStatus', y='LoanOriginalAmount', color=base_color)
plt.xticks(rotation=15);
plt.xlabel('EmploymentStatus')
plt.ylabel('LoanOriginalAmount');
plt.title('Box plot depicting relationship between EmploymentStatus and LoanOriginalAmount');
  • Generally borrowers in the Employed category have higher loan amounts.
  • Loan amounts above 25000 were issued to the Employed, Full-time and other categories.

How does the LoanOriginalAmount feature relate to IncomeRange feature?¶

In [58]:
#plot a Violin plot depicting relationship between IncomeRange and LoanOriginalAmount
base_color = sns.color_palette()[0]
sns.violinplot(data=df_copy, x='IncomeRange', y='LoanOriginalAmount', color=base_color, inner=None)
plt.xticks(rotation=15)
plt.xlabel('IncomeRange')
plt.ylabel('LoanOriginalAmount');
plt.title('Violin plot depicting relationship between IncomeRange and LoanOriginalAmount');
  • Most loan amounts above 25,000 were issued to borrowers in the largest income-range category of 100,000+
  • The $0, Not employed, 1-24,999, 25,000-49,999 and Not displayed categories have relatively higher distribution of lower loan amounts of less than 7000.

How does the LoanOriginalAmount feature relate to ProsperRating (numeric) feature?¶

In [41]:
#plot a box plot depicting relationship between ProsperRating (numeric) and LoanOriginalAmount
base_color = sns.color_palette()[0]
sns.boxplot(data=df_copy, x='ProsperRating (numeric)', y='LoanOriginalAmount', color=base_color)
plt.xlabel('ProsperRating (numeric)')
plt.ylabel('LoanOriginalAmount');
plt.title('Box plot depicting relationship between ProsperRating (numeric) and LoanOriginalAmount');
  • Borrowers of loans above 25,000 are in the 7, 6 and 5 of ProsperRating (numeric) categories. 1, 2 and 3 categories had the least highest amounts.

How does the ProsperRating (numeric) feature relate to BorrowerRate feature?¶

In [58]:
#plot a box plot depicting relationship between ProsperRating (numeric) and BorrowerRate
base_color = sns.color_palette()[0]
sns.boxplot(data=df_copy, x='ProsperRating (numeric)', y='BorrowerRate', color=base_color)
plt.xlabel('ProsperRating (numeric)')
plt.ylabel('BorrowerRate')
plt.title('Box plot depicting relationship between ProsperRating (numeric) and BorrowerRate');
  • The Box plot depicting relationship between ProsperRating (numeric) and BorrowerRate showcases that the general trend is that the higher the ProsperRating (numeric) the lower the BorrowerRate.

How does the LoanOriginalAmount feature relate to Term feature?¶

In [44]:
#plot a Violin plot depicting relationship between Term and LoanOriginalAmount
sns.violinplot(data=df_copy, x='Term', y='LoanOriginalAmount', color=base_color)
plt.title('Violin plot depicting relationship between Term and LoanOriginalAmount');
  • Term 12 and 36 have higher distribution of lower loan amounts of 5000 and below compared to Term 60 which has higher distribution of higher loan amounts of 10,000 and above.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

  • In regards to the relation betweeen LoanOriginalAmount and BorrowerRate, few Loan Amounts of 3000 and below had BorrowerRate of greater than 0.4. Loan Amounts of above 25000 had BorrowerRate of below 0.2.
  • Considering the relation between LoanOriginalAmount and CreditScoreRangeLower, most loan amounts CreditScoreRangeLower lie between 360 and 880. Loan Amounts of above 25000 associate with scores of 720 and above. A few Loan amounts have CreditScoreRangeLower of 0.
  • Relating DebtToIncomeRatio to LoanOriginalAmount, we observe that loan amounts above 25000 have low DebtToIncomeRatio of not more than 0.4 compared to other lower loan amounts. Most Loan amounts, regardless of the amount, have a DebtToIncomeRatio of between 0 and 1.
  • Given that most loan amounts above 25,000 are current, only 2 were delinquent with 18,239 and 70,734. 3 loan amounts of 4000 stood out with AmountDelinquent of above 300,000 and 400,000. Most loan amounts have AmountDelinquent of less than 30,000.
  • Most LoanOriginalAmounts have AvailableBankcardCredit of less than 100,000. Compared to low amounts, LoanOriginalAmounts of above 25000 do not have AvailableBankcardCredit greater than 200,000.
  • Considering the relation between LoanOriginalAmount and EmploymentStatus, generally borrowers in the Employed category have higher loan amounts. Loan amounts above 25000 were issued to the Employed, Full-time and other categories.
  • Most loan amounts above 25,000 were issued to borrowers in the largest income-range category of 100,000+. The $0, Not employed, 1-24,999, 25,000-49,999 and Not displayed categories have relatively higher distribution of lower loan amounts of less than 7000.
  • Borrowers of loans above 25,000 are in the 7, 6 and 5 of ProsperRating (numeric) categories. 1, 2 and 3 categories had the least highest amounts.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

  • The Box plot depicting relationship between ProsperRating (numeric) and BorrowerRate showcases that the general trend is that the higher the ProsperRating (numeric) the lower the BorrowerRate.

Multivariate Exploration¶

How do the IncomeRange, BorrowerRate, and LoanOriginalAmount features interact with each other?¶

In [50]:
#plot scatter plots of BorrowerRate against LoanOriginalAmount and facet using the IncomeRange feature
g = sns.FacetGrid(data = df_copy, col = 'IncomeRange', height = 4, col_wrap=3)
g.map(plt.scatter, 'BorrowerRate', 'LoanOriginalAmount');
  • Ploting scatter plots of BorrowerRate against LoanOriginalAmount and faceting using the IncomeRange feature depicts the 100000+ and the Not displayed IncomeRange categories as being significantly different from the other IncomeRange categories.
  • The 100000+ IncomeRange category is the one that only contains loan amounts above 25000 in addition to other lower amounts.
  • The Not displayed IncomeRange category contains the loan amounts that have the highest BorrowerRate compared to loan amounts in other IncomeRange categories.
In [53]:
#plot a violinplot of Term against LoanOriginalAmount and facet using the LoanStatus feature
g = sns.FacetGrid(data = df_copy, col = 'ProsperRating (numeric)', height = 4, col_wrap=3, sharex=False)
g.map(sns.violinplot, 'Term', 'LoanOriginalAmount');
/home/egunza/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py:670: UserWarning: Using the violinplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
In [46]:
#plot a violinplot of Term against LoanOriginalAmount and facet using the LoanStatus feature
g = sns.FacetGrid(data = df_copy, col = 'LoanStatus', height = 4, col_wrap=3, sharex=False)
g.map(sns.violinplot, 'Term', 'LoanOriginalAmount');
/home/egunza/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py:670: UserWarning: Using the violinplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

-

How do the ProsperRating (numeric), BorrowerRate, and LoanOriginalAmount features interact with each other?¶

In [48]:
#plot a scatter plot of BorrowerRate against LoanOriginalAmount and encode the 
#ProsperRating (numeric) using color.
groups = df_copy.groupby('ProsperRating (numeric)')
for name, group in groups:
    plt.plot(group.BorrowerRate, group.LoanOriginalAmount, marker='o', linestyle='', markersize=6, label=name)
plt.legend()
Out[48]:
<matplotlib.legend.Legend at 0x7f157b9dd2b0>
  • Across various amounts of the LoanOriginalAmount feature the BorrowerRate increases with the decrease in ProsperRating (numeric).

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

  • Ploting scatter plots of BorrowerRate against LoanOriginalAmount and faceting using the IncomeRange feature depicts the 100000+ and the Not displayed IncomeRange categories as being significantly different from the other IncomeRange categories. The 100000+ IncomeRange category is the one that only contains loan amounts above 25000 in addition to other lower amounts. The Not displayed IncomeRange category contains the loan amounts that have the highest BorrowerRate compared to loan amounts in other IncomeRange categories.

Were there any interesting or surprising interactions between features?¶

  • Across various amounts of the LoanOriginalAmount feature the BorrowerRate increases with the decrease in ProsperRating (numeric).

Conclusions¶

  • The distribution of the LoanOriginalAmount is right skewed with most loan amount between 2000 and 15000.
  • In regards to the relation betweeen LoanOriginalAmount and BorrowerRate, few Loan Amounts of 3000 and below had BorrowerRate of greater than 0.4. Loan Amounts of above 25000 had BorrowerRate of below 0.2.
  • Considering the relation between LoanOriginalAmount and CreditScoreRangeLower, most loan amounts CreditScoreRangeLower lie between 360 and 880. Loan Amounts of above 25000 associate with scores of 720 and above. A few Loan amounts have CreditScoreRangeLower of 0.
  • Relating DebtToIncomeRatio to LoanOriginalAmount, we observe that loan amounts above 25000 have low DebtToIncomeRatio of not more than 0.4 compared to other lower loan amounts. Most Loan amounts, regardless of the amount, have a DebtToIncomeRatio of between 0 and 1.
  • Given that most loan amounts above 25,000 are current, only 2 were delinquent with 18,239 and 70,734. 3 loan amounts of 4000 stood out with AmountDelinquent of above 300,000 and 400,000. Most loan amounts have AmountDelinquent of less than 30,000.
  • Most LoanOriginalAmounts have AvailableBankcardCredit of less than 100,000. Compared to low amounts, LoanOriginalAmounts of above 25000 do not have AvailableBankcardCredit greater than 200,000.
  • Considering the relation between LoanOriginalAmount and EmploymentStatus, generally borrowers in the Employed category have higher loan amounts. Loan amounts above 25000 were issued to the Employed, Full-time and other categories.
  • Most loan amounts above 25,000 were issued to borrowers in the largest income-range category of 100,000+. The $0, Not employed, 1-24,999, 25,000-49,999 and Not displayed categories have relatively higher distribution of lower loan amounts of less than 7000.
  • Borrowers of loans above 25,000 are in the 7, 6 and 5 of ProsperRating (numeric) categories. 1, 2 and 3 categories had the least highest amounts.
  • The Box plot depicting relationship between ProsperRating (numeric) and BorrowerRate showcases that the general trend is that the higher the ProsperRating (numeric) the lower the BorrowerRate.
  • Ploting scatter plots of BorrowerRate against LoanOriginalAmount and faceting using the IncomeRange feature depicts the 100000+ and the Not displayed IncomeRange categories as being significantly different from the other IncomeRange categories. The 100000+ IncomeRange category is the one that only contains loan amounts above 25000 in addition to other lower amounts. The Not displayed IncomeRange category contains the loan amounts that have the highest BorrowerRate compared to loan amounts in other IncomeRange categories.
  • Across various amounts of the LoanOriginalAmount feature the BorrowerRate increases with the decrease in ProsperRating (numeric).