wrangle_in_py.column_name_standardizer

Functions

`string_standardizer`(messy_string)	Converts the inputted messy_string to lowercase and
`resulting_duplicates`(original_strings, ...)	Identifies which strings became duplicates after standardization.
`column_name_standardizer`(df)	Returns a copy of the inputted dataframe with standardized column names.

Module Contents

wrangle_in_py.column_name_standardizer.string_standardizer(messy_string)[source]

Converts the inputted messy_string to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.

Parameters:: messy_string (str) – The input string to be standardized.
Raises:: TypeError : – If the input messy_string is not a string.
Returns:: A standardized version of the input string in lowercase and non-alphanumeric characters replaced by underscores.
Return type:: str

Examples

>>> string_standardizer('Jack Fruit 88')
'jack_fruit_88'

>>> string_standardizer('PINEAPPLES')
'pineapples'

>>> string_standardizer('Dragon (Fruit)')
'dragon__fruit_'

wrangle_in_py.column_name_standardizer.resulting_duplicates(original_strings, standardized_strings)[source]

Identifies which strings became duplicates after standardization.

Parameters:

original_strings (list of str) – List of strings before standardization.
standardized_strings (list of str) – List of strings after standardization.

Raises:

ValueError : – If the inputs original_strings and standardized_strings are not the same length.
TypeError : – If either of the inputs, original_strings or standardized_strings, are not a list of strings.

Returns:

A dictionary where the keys are the standardized strings with duplicate(s), and the values are lists of the original strings that map to them.

Return type:

dict

Examples

>>> strings_before = ['Jack Fruit 88.', "Jack!Fruit!88!", "PINEAPPLES"]
>>> strings_after = ["jack_fruit_88_", "jack_fruit_88_", "pineapples"]
>>> identify_duplicates(strings_before, strings_after)
{'jack_fruit_88_': ['Jack Fruit 88.', 'Jack!Fruit!88!']}

wrangle_in_py.column_name_standardizer.column_name_standardizer(df)[source]

Returns a copy of the inputted dataframe with standardized column names. Column names will be converted to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.

If the standardization results in duplicate column names, a warning will be raised.

Parameters:: df (pandas DataFrame) – The input pandas DataFrame whose column names need standardization.

Warning

UserWarning :: If any of the standardized column names are the same.

Raises:: TypeError: – If the input dataframe is not a pandas DataFrame.
Returns:: A new DataFrame with standardized column names.
Return type:: pandas.DataFrame

Examples

>>> import pandas as pd
>>> data = {'Jack Fruit 88': [1, 2], 'PINEAPPLES': [3, 4], 'Dragon (Fruit)': [25, 30]}
>>> df = pd.DataFrame(data)
>>> column_name_standardizer(df)
   jack_fruit_88  pineapples  dragon__fruit_
0           1          3         25
1           2          4         30