发表自话题:求30天周期借款平台
上一篇文章介绍了天池“金融风控-贷款违约预测”的赛题分析。
「机器学习」天池金融风控-贷款违约预测赛题分析
该篇文章为第二部分——数据分析,一起了解数据,熟悉数据,为后续的特征工程做准备。
目的:
EDA价值主要在于熟悉了解整个数据集的基本情况(缺失值,异常值),对数据集进行验证是否可以进行接下来的机器学习或者深度学习建模.了解变量间的相互关系、变量与预测值之间的存在关系。为特征工程做准备比赛地址:https://tianchi./competition/entrance//introduction
查看一下具体的列名,赛题理解部分已经给出具体的特征含义,这里方便阅读再给一下:
FieldDescriptionid为贷款清单分配的唯一信用证标识loanAmnt贷款金额term贷款期限(year)interestRate贷款利率installment分期付款金额grade贷款等级subGrade贷款等级之子级employmentTitle就业职称employmentLength就业年限(年)homeOwnership借款人在登记时提供的房屋所有权状况annualIncome年收入verificationStatus验证状态issueDate贷款发放的月份purpose借款人在贷款申请时的贷款用途类别postCode借款人在贷款申请中提供的邮政编码的前3位数字regionCode地区编码dti债务收入比delinquency_2years借款人过去2年信用档案中逾期30天以上的违约事件数ficoRangeLow借款人在贷款发放时的fico所属的下限范围ficoRangeHigh借款人在贷款发放时的fico所属的上限范围openAcc借款人信用档案中未结信用额度的数量pubRec贬损公共记录的数量pubRecBankruptcies公开记录清除的数量revolBal信贷周转余额合计revolUtil循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额totalAcc借款人信用档案中当前的信用额度总数initialListStatus贷款的初始列表状态applicationType表明贷款是个人申请还是与两个共同借款人的联合申请earliesCreditLine借款人最早报告的信用额度开立的月份title借款人提供的贷款名称policyCode公开可用的策略_代码=1新产品不公开可用的策略_代码=2n系列匿名特征匿名特征n0-n14,为一些贷款人行为计数特征的处理通过info()来熟悉数据类型
data_train.info()RangeIndex: entries, 0 to Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id non-null int64 1 loanAmnt non-null float64 2 term non-null int64 3 interestRate non-null float64 4 installment non-null float64 5 grade non-null object 6 subGrade non-null object 7 employmentTitle non-null float64 8 employmentLength non-null object 9 homeOwnership non-null int64 10 annualIncome non-null float64 11 verificationStatus non-null int64 12 issueDate non-null object 13 isDefault non-null int64 14 purpose non-null int64 15 postCode non-null float64 16 regionCode non-null int64 17 dti non-null float64 18 delinquency_2years non-null float64 19 ficoRangeLow non-null float64 20 ficoRangeHigh non-null float64 21 openAcc non-null float64 22 pubRec non-null float64 23 pubRecBankruptcies non-null float64 24 revolBal non-null float64 25 revolUtil non-null float64 26 totalAcc non-null float64 27 initialListStatus non-null int64 28 applicationType non-null int64 29 earliesCreditLine non-null object 30 title non-null float64 31 policyCode non-null float64 32 n0 non-null float64 33 n1 non-null float64 34 n2 non-null float64 35 n2.1 non-null float64 36 n4 non-null float64 37 n5 non-null float64 38 n6 non-null float64 39 n7 non-null float64 40 n8 non-null float64 41 n9 non-null float64 42 n10 non-null float64 43 n11 non-null float64 44 n12 non-null float64 45 n13 non-null float64 46 n14 non-null float64dtypes: float64(33), int64(9), object(5)memory usage: 286.9+ MB总体粗略的查看数据集各个特征的一些基本统计量
data_train.describe() idloanAmntterminterestRateinstallmentemploymentTitlehomeOwnershipannualIncomeverificationStatusisDefault…n5n6n7n8n9n10n11n12n13n14count.000000.000000.000000.000000.000000.000000.0000008.000000e+0.000000.000000….000000.000000.000000.000000.000000.000000.000000.000000.000000.000000mean........e+041.0096830.…8.......0008150.0033840.0.std..0......e+040..…4.......0.0..min0.000000500.0000003.0000005...0000000.0000000.000000e+000.0000000.000000…0.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.00000025%..0000003.0000009...0000000.0000004.e+040.0000000.000000…5.0000004.0000005.0000009.0000003.0000008.0000000.0000000.0000000.0000001.00000050%..0000003.00000012...0000001.0000006.e+041.0000000.000000…7.0000007.0000007.00000013.0000005.00000011.0000000.0000000.0000000.0000002.00000075%..0000003.00000015....0000009.000000e+042.0000000.000000…11.00000011.00000010.00000019.0000007.00000014.0000000.0000000.0000000.0000003.000000max.00000040000.0000005.00000030...0000005.0000001.099920e+072.0000001.000000…70.000000132.00000079.000000128.00000045.00000082.0000004.0000004.00000039.00000030.0000008 rows × 42 columns
data_train.head(3).append(data_train.tail(3)) idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership…n5n6n7n8n9n10n11n12n13n.0519.52917.97EE2320.02 years2…9.08.04.012.02.07.00.00.00.02.0.0518.49461.90DD.05 years0…NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN.0516.99298.17DD.08 years0…0.021.04.05.03.011.00.00.00.04.0.0313.33203.12CC32582.010+ years1…4.026.04.010.04.05.00.00.01.04.0.036.92592.14AA4151.010+ years0…10.06.012.022.08.016.00.00.00.05.0.0311.06294.91BB313.05 years0…3.04.04.08.03.07.00.00.00.02.06 rows × 47 columns
上面得到训练集有22列特征有缺失值,进一步查看缺失特征中缺失率大于50%的特征
have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()fea_null_moreThanHalf = {}for key,value in have_null_fea_dict.items(): if value > 0.5: fea_null_moreThanHalf[key] = value fea_null_moreThanHalf {} 具体的查看缺失特征及缺失率# nan可视化missing = data_train.isnull().sum()/len(data_train)missing = missing[missing > 0]missing.sort_values(inplace=True)missing.plot.bar()Tips: 比赛大杀器lgb模型可以自动处理缺失值!
查看训练集测试集中特征属性只有一值的特征one_value_fea = [col for col in data_train.columns if data_train[col].nunique()上一篇:网贷老赖注意了!借钱不还除了上征信,还有这些严厉措施……
下一篇:银行监管体系全解析_风险
2021-02-14
氪信资深数据科学家主讲:如何构建基于AI的金融风控系统 | 雷锋网公开课 | 雷锋网
2021-02-07
2021-02-02
2021-01-29
2020-12-11
2020-10-09
2020-10-06
2020-09-18
2020-09-18
2020-09-18