wrangle_in_py.column_name_standardizer
Functions
|
Converts the inputted messy_string to lowercase and |
|
Identifies which strings became duplicates after standardization. |
Returns a copy of the inputted dataframe with standardized column names. |
Module Contents
- wrangle_in_py.column_name_standardizer.string_standardizer(messy_string)[source]
Converts the inputted messy_string to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.
- Parameters:
messy_string (str) – The input string to be standardized.
- Raises:
TypeError : – If the input messy_string is not a string.
- Returns:
A standardized version of the input string in lowercase and non-alphanumeric characters replaced by underscores.
- Return type:
str
Examples
>>> string_standardizer('Jack Fruit 88') 'jack_fruit_88'
>>> string_standardizer('PINEAPPLES') 'pineapples'
>>> string_standardizer('Dragon (Fruit)') 'dragon__fruit_'
- wrangle_in_py.column_name_standardizer.resulting_duplicates(original_strings, standardized_strings)[source]
Identifies which strings became duplicates after standardization.
- Parameters:
original_strings (list of str) – List of strings before standardization.
standardized_strings (list of str) – List of strings after standardization.
- Raises:
ValueError : – If the inputs original_strings and standardized_strings are not the same length.
TypeError : – If either of the inputs, original_strings or standardized_strings, are not a list of strings.
- Returns:
A dictionary where the keys are the standardized strings with duplicate(s), and the values are lists of the original strings that map to them.
- Return type:
dict
Examples
>>> strings_before = ['Jack Fruit 88.', "Jack!Fruit!88!", "PINEAPPLES"] >>> strings_after = ["jack_fruit_88_", "jack_fruit_88_", "pineapples"] >>> identify_duplicates(strings_before, strings_after) {'jack_fruit_88_': ['Jack Fruit 88.', 'Jack!Fruit!88!']}
- wrangle_in_py.column_name_standardizer.column_name_standardizer(df)[source]
Returns a copy of the inputted dataframe with standardized column names. Column names will be converted to lowercase and non-alphanumerics (including spaces and punctuation) will be replaced with underscores.
If the standardization results in duplicate column names, a warning will be raised.
- Parameters:
df (pandas DataFrame) – The input pandas DataFrame whose column names need standardization.
Warning
- UserWarning :
If any of the standardized column names are the same.
- Raises:
TypeError: – If the input dataframe is not a pandas DataFrame.
- Returns:
A new DataFrame with standardized column names.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> data = {'Jack Fruit 88': [1, 2], 'PINEAPPLES': [3, 4], 'Dragon (Fruit)': [25, 30]} >>> df = pd.DataFrame(data) >>> column_name_standardizer(df) jack_fruit_88 pineapples dragon__fruit_ 0 1 3 25 1 2 4 30