wrangle_in_py.column_name_standardizer

Functions

string_standardizer(messy_string)

Converts the inputted messy_string to lowercase and

resulting_duplicates(original_strings, ...)

Identifies which strings became duplicates after standardization.

column_name_standardizer(df)

Returns a copy of the inputted dataframe with standardized column names.

Module Contents

wrangle_in_py.column_name_standardizer.string_standardizer(messy_string)[source]

Converts the inputted messy_string to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.

Parameters:

messy_string (str) – The input string to be standardized.

Raises:

TypeError : – If the input messy_string is not a string.

Returns:

A standardized version of the input string in lowercase and non-alphanumeric characters replaced by underscores.

Return type:

str

Examples

>>> string_standardizer('Jack Fruit 88')
'jack_fruit_88'
>>> string_standardizer('PINEAPPLES')
'pineapples'
>>> string_standardizer('Dragon (Fruit)')
'dragon__fruit_'
wrangle_in_py.column_name_standardizer.resulting_duplicates(original_strings, standardized_strings)[source]

Identifies which strings became duplicates after standardization.

Parameters:
  • original_strings (list of str) – List of strings before standardization.

  • standardized_strings (list of str) – List of strings after standardization.

Raises:
  • ValueError : – If the inputs original_strings and standardized_strings are not the same length.

  • TypeError : – If either of the inputs, original_strings or standardized_strings, are not a list of strings.

Returns:

A dictionary where the keys are the standardized strings with duplicate(s), and the values are lists of the original strings that map to them.

Return type:

dict

Examples

>>> strings_before = ['Jack Fruit 88.', "Jack!Fruit!88!", "PINEAPPLES"]
>>> strings_after = ["jack_fruit_88_", "jack_fruit_88_", "pineapples"]
>>> identify_duplicates(strings_before, strings_after)
{'jack_fruit_88_': ['Jack Fruit 88.', 'Jack!Fruit!88!']}
wrangle_in_py.column_name_standardizer.column_name_standardizer(df)[source]

Returns a copy of the inputted dataframe with standardized column names. Column names will be converted to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.

If the standardization results in duplicate column names, a warning will be raised.

Parameters:

df (pandas DataFrame) – The input pandas DataFrame whose column names need standardization.

Warning

UserWarning :

If any of the standardized column names are the same.

Raises:

TypeError: – If the input dataframe is not a pandas DataFrame.

Returns:

A new DataFrame with standardized column names.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> data = {'Jack Fruit 88': [1, 2], 'PINEAPPLES': [3, 4], 'Dragon (Fruit)': [25, 30]}
>>> df = pd.DataFrame(data)
>>> column_name_standardizer(df)
   jack_fruit_88  pineapples  dragon__fruit_
0           1          3         25
1           2          4         30