Preprocess data¶
analysis ¶
init_check ¶
init_check(
df,
identifier=None,
cat_cols=None,
cont_cols=None,
verbose=False,
)
Procedure to check: * duplicated rows in teh dataset * general stats of numerical features * general stats of categorical features
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
identifier
(
str
) –column which identifies unique user IDs
-
cat_cols
(
list
) –categorical features in the dataset
-
cont_cols
(
list
) –numerical features in the dataset
Returns:
-
duplicated_ids(
int
) – -
cont_cols_desc(
pd.DataFrame
) – -
cat_cols_desc(
pd.DataFrame
) –
Source code in churn_pred/eda/features/analysis.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
|
missing ¶
missing(df, scale='linear', plot=False)
Procedure to check fraction of missing values in the dataset.
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
missing_val_frac(
DataFrame
) –sorted dataframe with fraction of missing values per feature
-
fig(
Figure
) –plot with sorted fractions of missing values in each column
Source code in churn_pred/eda/features/analysis.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
zero ¶
zero(df, scale='linear', plot=False)
Procedure to check fraction of zero values in the dataset.
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
zero_val_frac(
DataFrame
) –sorted dataframe with fraction of '0' values per feature
-
fig(
Figure
) –plot with sorted fractions of '0' values in each column
Source code in churn_pred/eda/features/analysis.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
nunique ¶
nunique(df, scale='linear', plot=False)
Procedure to plot features sorted by their number of unique values.
Parameters:
-
df
(
DataFrame
) –dataset
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
fig(
Figure
) –plot with sorted number of unique values in features
Source code in churn_pred/eda/features/analysis.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
std ¶
std(df, scale='linear', plot=False)
Procedure to plot features sorted by their variance.
Parameters:
-
df
(
DataFrame
) –dataset
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
fig(
Figure
) –plot with sorted standard deviation of continuous features
Source code in churn_pred/eda/features/analysis.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
|
entropy ¶
entropy(df, scale='linear', plot=False)
Procedure to plot features sorted by their entropy.
Parameters:
-
df
(
DataFrame
) –dataset
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
fig(
Figure
) –plot with sorted entropy of the features
Source code in churn_pred/eda/features/analysis.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
plotting ¶
cross_correlation ¶
cross_correlation(df, n=10, verbose=False)
Procedure to calculate and plot cross-correlation of features in the dataset.
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
verbose
(
bool
) –show n most correlated features
-
n
(
int
) –number correlated features in verbose output
Returns:
-
fig(
Figure
) –heatmap of continuous features cross correlations
Source code in churn_pred/eda/features/plotting.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
distributions ¶
distributions(
df, low_per_cut=0, high_per_cut=1, type="box"
)
Procedure to plot distributions of the features splitted by column split_col using workaround for violinplots in seaborn: * https://stackoverflow.com/a/64787568/8147433 DISCALIMER: for now the cont_cols NA vals are filled with 0
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
low_per_cut
(
float
) –lower percentile where to cut the plot for better readability
-
high_per_cut
(
float
) –higher percentile where to cut the plot for better readability
-
type
(
str
) –type of distribution plot
Returns:
-
fig(
Figure
) –ditribution plot per each feature
Source code in churn_pred/eda/features/plotting.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
analysis ¶
correlation ¶
correlation(df, target, scale='linear', plot=False)
Procedure to plot most correlated numerical features with target column.
Parameters:
-
df
(
pd.DataFrame
) –pandas dataframe
-
target
(
pd.Series
) –target values
-
scale
(
str
) –y scale of the plot
-
plot
(
bool
) –whether to output the plot
Returns:
-
sorted_corr_cols(
pd.DataFrame
) –sorted dataframe with feature and target value correlation
-
fig(
Figure
) –sorted bar plot with feature and target value correlation
Source code in churn_pred/eda/target/analysis.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
plotting ¶
prob_distrib_per_class ¶
prob_distrib_per_class(predicted_probs, actual, task)
Procedure to plot probability density distributions per class from LightGBM predictions.
Parameters:
-
predicted_probs
(
ndarray
) –predicted probs
-
actual
(
ndarray
) –ground truth classes
-
task
(
str
) –type of task
Returns:
-
fig(
Figure
) –probability density ditributions plot per each class
Source code in churn_pred/eda/target/plotting.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
distributions_in_binary_cls ¶
distributions_in_binary_cls(
df, target, low_per_cut=0, high_per_cut=1
)
Procedure to plot distributions of the features splitted by column split_col using workaround for violinplots in seaborn: * https://stackoverflow.com/a/64787568/8147433 DISCALIMER: for now the cont_cols NA vals are filled with 0
Parameters:
-
df
(
DataFrame
) –pandas dataframe
-
target
(
pd.Series
) –target values, i.e. binary classes
Returns:
-
fig(
Figure
) –ditribution plot per each feature
Source code in churn_pred/eda/target/plotting.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
plotting ¶
bar_plot ¶
bar_plot(df, ax, title)
Helper method for unified bar plot
Parameters:
-
df
(
DataFrame
) –dataframe to plot
-
ax
(
Axes
) –axes defining where to plot it
Returns:
-
adjusted_plot(
Axes
) –adjusted axes
Source code in churn_pred/eda/plotting.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
general_utils ¶
intsec ¶
intsec(list1, list2)
Simple intesection of two lists.
Parameters:
-
list1
(
list
) –list1
-
list2
(
list
) –list2
Returns:
-
list(
list
) –intersection of lists
Source code in churn_pred/eda/general_utils.py
6 7 8 9 10 11 12 13 14 |
|
entropy_calc ¶
entropy_calc(labels, base=np.e)
Computes entropy of both continuous and categorical features. Shamelessly stolen from : https://stackoverflow.com/a/45091961
Parameters:
-
labels
(
list, ndarray, Series
) –list of values
Returns:
-
ent(
float
) –entropy of the list of values
Source code in churn_pred/eda/general_utils.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|