Preprocess data¶
entropy(df, scale='linear', plot=False)
¶
Procedure to plot features sorted by their entropy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
dataset |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
plot with sorted entropy of the features |
Source code in inference_model/eda/features/analysis.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
init_check(df, identifier=None, cat_cols=None, cont_cols=None, verbose=False)
¶
Procedure to check: * duplicated rows in teh dataset * general stats of numerical features * general stats of categorical features
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
identifier
|
str
|
column which identifies unique user IDs |
None
|
cat_cols
|
list
|
categorical features in the dataset |
None
|
cont_cols
|
list
|
numerical features in the dataset |
None
|
Returns:
Name | Type | Description |
---|---|---|
duplicated_ids |
int
|
|
cont_cols_desc |
DataFrame
|
|
cat_cols_desc |
DataFrame
|
|
Source code in inference_model/eda/features/analysis.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
missing(df, scale='linear', plot=False)
¶
Procedure to check fraction of missing values in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
missing_val_frac |
DataFrame
|
sorted dataframe with fraction of missing values per feature |
fig |
Figure
|
plot with sorted fractions of missing values in each column |
Source code in inference_model/eda/features/analysis.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
nunique(df, scale='linear', plot=False)
¶
Procedure to plot features sorted by their number of unique values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
dataset |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
plot with sorted number of unique values in features |
Source code in inference_model/eda/features/analysis.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
std(df, scale='linear', plot=False)
¶
Procedure to plot features sorted by their variance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
dataset |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
plot with sorted standard deviation of continuous features |
Source code in inference_model/eda/features/analysis.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
zero(df, scale='linear', plot=False)
¶
Procedure to check fraction of zero values in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
zero_val_frac |
DataFrame
|
sorted dataframe with fraction of '0' values per feature |
fig |
Figure
|
plot with sorted fractions of '0' values in each column |
Source code in inference_model/eda/features/analysis.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
cross_correlation(df, n=10, verbose=False)
¶
Procedure to calculate and plot cross-correlation of features in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
verbose
|
bool
|
show n most correlated features |
False
|
n
|
int
|
number correlated features in verbose output |
10
|
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
heatmap of continuous features cross correlations |
Source code in inference_model/eda/features/plotting.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
distributions(df, low_per_cut=0, high_per_cut=1, type='box')
¶
Procedure to plot distributions of the features splitted by column split_col using workaround for violinplots in seaborn: * https://stackoverflow.com/a/64787568/8147433 DISCALIMER: for now the cont_cols NA vals are filled with 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
low_per_cut
|
float
|
lower percentile where to cut the plot for better readability |
0
|
high_per_cut
|
float
|
higher percentile where to cut the plot for better readability |
1
|
type
|
str
|
type of distribution plot |
'box'
|
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
ditribution plot per each feature |
Source code in inference_model/eda/features/plotting.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
correlation(df, target, scale='linear', plot=False)
¶
Procedure to plot most correlated numerical features with target column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
target
|
Series
|
target values |
required |
scale
|
str
|
y scale of the plot |
'linear'
|
plot
|
bool
|
whether to output the plot |
False
|
Returns:
Name | Type | Description |
---|---|---|
sorted_corr_cols |
DataFrame
|
sorted dataframe with feature and target value correlation |
fig |
Figure
|
sorted bar plot with feature and target value correlation |
Source code in inference_model/eda/target/analysis.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
distributions_in_binary_cls(df, target, low_per_cut=0, high_per_cut=1)
¶
Procedure to plot distributions of the features splitted by column split_col using workaround for violinplots in seaborn: * https://stackoverflow.com/a/64787568/8147433 DISCALIMER: for now the cont_cols NA vals are filled with 0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
pandas dataframe |
required |
cont_cols
|
list
|
numerical features in the dataset |
required |
target
|
Series
|
target values, i.e. binary classes |
required |
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
ditribution plot per each feature |
Source code in inference_model/eda/target/plotting.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
prob_distrib_per_class(predicted_probs, actual, task)
¶
Procedure to plot probability density distributions per class from LightGBM predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
predicted_probs
|
ndarray
|
predicted probs |
required |
actual
|
ndarray
|
ground truth classes |
required |
task
|
str
|
type of task |
required |
Returns:
Name | Type | Description |
---|---|---|
fig |
Figure
|
probability density ditributions plot per each class |
Source code in inference_model/eda/target/plotting.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
bar_plot(df, ax, title)
¶
Helper method for unified bar plot
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
dataframe to plot |
required |
ax
|
Axes
|
axes defining where to plot it |
required |
Returns:
Name | Type | Description |
---|---|---|
adjusted_plot |
Axes
|
adjusted axes |
Source code in inference_model/eda/plotting.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
entropy_calc(labels, base=np.e)
¶
Computes entropy of both continuous and categorical features. Shamelessly stolen from : https://stackoverflow.com/a/45091961 Args: labels (list, ndarray, Series): list of values Returns: ent (float): entropy of the list of values
Source code in inference_model/eda/general_utils.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
intsec(list1, list2)
¶
Simple intesection of two lists. Args: list1 (list): list1 list2 (list): list2 Returns: list (list): intersection of lists
Source code in inference_model/eda/general_utils.py
6 7 8 9 10 11 12 13 14 |
|