wrangle_in_py.column_drop_threshold

Functions

column_drop_threshold(df, threshold[, variance])

Returns a copy of the dataframe inputted with columns removed if they did not meet the threshold specified,

Module Contents

wrangle_in_py.column_drop_threshold.column_drop_threshold(df, threshold, variance=None)[source]

Returns a copy of the dataframe inputted with columns removed if they did not meet the threshold specified, and with columns removed if they had a coefficient of variance lower than specified.

Parameters:
  • df (pd.DataFrame) – The input pandas dataframe whose missingness threshold and coefficient of variance needs to be checked

  • threshold (float) – Must be 0 <= threshold <= 1 The threshold for the proportion of missing values to allow in each column of the dataframe, Columns with a larger proportion of missing observations than the threshold will be removed from the dataframe

  • variance (float) –

    Default is None The lowest coefficient of variance to allow in any one column of the dataframe Columns with a lower variance than specified will be removed from the dataframe A column must have at least 2 numbers for coefficient of variance to be calculated

    because the coefficient of variance cannot be calculated with 1 or 0 numbers.

Raises:
  • TypeError : – If the input for df is not a pandas DataFrame.

  • ValueError : – If the input for threshold is not a float and in the inclusive range 0 and 1. Or if the input for variance is not a float >=0.

Returns:

A new dataframe where each column meets or exceeds the specified allowable missingness threshold, and the variance threshold. Any columns previously not meeting the thresholds have been removed.

Return type:

pd.DataFrame

Examples

>>> data = {'apple': [1, 2, NaN], 'banana': [3, 4, 5], 'kiwi': [NaN, 30, NaN], 'peach': [2, 2, 2]}
>>> df = pd.DataFrame(data)
>>> column_drop_threshold(df, 0.35, 0.1)
    apple banana
0   1     3
1   2     4
2   NaN   5