Python数据分析,清洗数据 7 步走( 二 )

再创建一列:
data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)再创建一列:
data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]data1结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr... ... ... ... ... ... ... ... ... ... ... ... ...886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr891 rows × 12 columns5.3 分箱走起data1['FareCut'] = pd.qcut(data1['Fare'], 4)data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)data1结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667]1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0]2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667]3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0]4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0]... ... ... ... ... ... ... ... ... ... ... ... ... ... ...886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0]887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667]888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0]889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667]890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0]891 rows × 14 columns6 编码6.1 LabelEncoder 方法使用 Sklearn 的 LabelEncoder
from sklearn.preprocessing import LabelEncoderlabel = LabelEncoder()data1['Sex_Code'] = label.fit_transform(data1['Sex'])data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])data1['Title_Code'] = label.fit_transform(data1['Title'])data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])data1结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!
6.2 get_dummies 方法get_dummies 将长 DataFrame 变为宽 DataFrame:
pd.get_dummies(data1['Sex'])结果:
female male0 0 11 1 02 1 03 1 04 0 1... ... ...886 0 1887 1 0888 1 0889 0 1890 0 1891 rows × 2 columns而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.
label.fit_transform(data1['Sex'])0110203041..88618870888088918901Name: Sex_Code, Length: 891, dtype: int647 再 check# Step 7: data cleaning checkdata1[data1_x_alg].info()print('-'*50)data1_dummy.info()结果:
<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 8 columns):Sex_Code891 non-null int64Pclass891 non-null int64Embarked_Code891 non-null int64Title_Code891 non-null int64SibSp891 non-null int64Parch891 non-null int64Age891 non-null float64Fare891 non-null float64dtypes: float64(2), int64(6)memory usage: 55.8 KB--------------------------------------------------<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 29 columns):Pclass891 non-null int64SibSp891 non-null int64Parch891 non-null int64Age891 non-null float64Fare891 non-null float64FamilySize891 non-null int64IsAlone891 non-null int64Sex_female891 non-null uint8Sex_male891 non-null uint8Embarked_C891 non-null uint8Embarked_Q891 non-null uint8Embarked_S891 non-null uint8Title_Capt891 non-null uint8Title_Col891 non-null uint8Title_Don891 non-null uint8Title_Dr891 non-null uint8Title_Jonkheer891 non-null uint8Title_Lady891 non-null uint8Title_Major891 non-null uint8Title_Master891 non-null uint8Title_Miss891 non-null uint8Title_Mlle891 non-null uint8Title_Mme891 non-null uint8Title_Mr891 non-null uint8Title_Mrs891 non-null uint8Title_Ms891 non-null uint8Title_Rev891 non-null uint8Title_Sir891 non-null uint8Title_the Countess891 non-null uint8dtypes: float64(2), int64(5), uint8(22)memory usage: 68.0 KB


推荐阅读