Example usage of wrangle_in_py in a project

In this tutorial you’ll learn how to use some functions that help with some common data wrangling and tidying tasks along with the pandas python package. This will prepare the dataframe for any future analysis you might be interested in including exploratory data analysis, formal progress reports, machine learning, etc. For this tutorial we will create our own messy dataframe to work with as an exmaple. In order to keep it simple it has just 18 rows of data.

To walk through the process of using our wrangle_in_py package we’ll introduce Susie, who has opened her first ice cream shop and is eager to get going and eventually expand into more locations. However, before she gets too ahead of herself she’s trying to collect some simple data about her employees from her shop that she’s hoping to use later on. But for now it looks pretty messy so let’s walk through cleaning up the dataframe with her!

import pandas as pd
import wrangle_in_py

print(wrangle_in_py.__version__)
0.0.4

To begin we’ll load Susie’s dataframe in. We can see she’s already got quite a few employees, their role in the shop and their shift start and end times.

data_dict = {
    "Shift Time": [
        "2025-01-01 09:00:00", "2025-01-01 17:00:00", "2025-01-04 01:00:00", 
        "2025-01-02 09:00:00", "2025-01-02 17:00:00", "2025-01-03 01:00:00", 
        "2025-01-05 09:00:00", "2025-01-03 17:00:00", "2025-01-04 01:00:00", 
        "2025-01-04 09:00:00", "2025-01-04 17:00:00", "2025-01-05 01:00:00", 
        "2025-01-05 09:00:00", "2025-01-05 17:00:00", "2025-01-06 01:00:00",
        "2025-01-06 01:00:00", "2025-01-06 01:00:00", "2025-01-06 01:00:00"
    ],
    "Employee ID": [101, 102, 103, 104, 105, 106, 107, 102, 103, 108, 109, 110, 101, 111, 112, 112, 112, 112],
    "Shop ID": [1] * 18,
    "Task Assigned": [
        "Inventory", "Cashier", None, "Cleaning", "Cashier", "Stocking", 
        "Inventory", None, "Cleaning", "Cashier", "Stocking", "Inventory", 
        "Cashier", None, "Cleaning", "Cleaning", "Cleaning", "Cleaning"
    ]
}


# Convert the dictionary into a pandas DataFrame
df_from_dict = pd.DataFrame(data_dict)

# Ensure the "Shift Time" column is parsed as datetime
df_from_dict["Shift Time"] = pd.to_datetime(df_from_dict["Shift Time"])

Great! Now let’s clean up this dataframe!

Step 1: Renaming Columns

First off, Susie noticed that the column names of her data might cause issues down the line in her analysis since they include spaces. She also thinks it might be a good idea to have all of the column names in lowercase too.

At first, she considers changing them all of the column names manually, but it would be an annoying task that is prone to errors. For instance, she may rename the “Employee ID” column to “shop_id” by accident!

No worries! column_name_standardizer will come to the rescue! After taking in a pandas DataFrame as an input, it will return a copy of the dataframe with all of the column names converted into lowercase and spaces or any non-numerics converted into into underscores!

Here is the dataframe before the column names are converted:

df_from_dict
Shift Time Employee ID Shop ID Task Assigned
0 2025-01-01 09:00:00 101 1 Inventory
1 2025-01-01 17:00:00 102 1 Cashier
2 2025-01-04 01:00:00 103 1 None
3 2025-01-02 09:00:00 104 1 Cleaning
4 2025-01-02 17:00:00 105 1 Cashier
5 2025-01-03 01:00:00 106 1 Stocking
6 2025-01-05 09:00:00 107 1 Inventory
7 2025-01-03 17:00:00 102 1 None
8 2025-01-04 01:00:00 103 1 Cleaning
9 2025-01-04 09:00:00 108 1 Cashier
10 2025-01-04 17:00:00 109 1 Stocking
11 2025-01-05 01:00:00 110 1 Inventory
12 2025-01-05 09:00:00 101 1 Cashier
13 2025-01-05 17:00:00 111 1 None
14 2025-01-06 01:00:00 112 1 Cleaning
15 2025-01-06 01:00:00 112 1 Cleaning
16 2025-01-06 01:00:00 112 1 Cleaning
17 2025-01-06 01:00:00 112 1 Cleaning

And here is after:

from wrangle_in_py.column_name_standardizer import column_name_standardizer

new_df = column_name_standardizer(df_from_dict)

new_df
shift_time employee_id shop_id task_assigned
0 2025-01-01 09:00:00 101 1 Inventory
1 2025-01-01 17:00:00 102 1 Cashier
2 2025-01-04 01:00:00 103 1 None
3 2025-01-02 09:00:00 104 1 Cleaning
4 2025-01-02 17:00:00 105 1 Cashier
5 2025-01-03 01:00:00 106 1 Stocking
6 2025-01-05 09:00:00 107 1 Inventory
7 2025-01-03 17:00:00 102 1 None
8 2025-01-04 01:00:00 103 1 Cleaning
9 2025-01-04 09:00:00 108 1 Cashier
10 2025-01-04 17:00:00 109 1 Stocking
11 2025-01-05 01:00:00 110 1 Inventory
12 2025-01-05 09:00:00 101 1 Cashier
13 2025-01-05 17:00:00 111 1 None
14 2025-01-06 01:00:00 112 1 Cleaning
15 2025-01-06 01:00:00 112 1 Cleaning
16 2025-01-06 01:00:00 112 1 Cleaning
17 2025-01-06 01:00:00 112 1 Cleaning

Now the column names have been converted from “Shift Time”, “Employee ID”, “Shop ID”, and “Task Assigned” to “shift_time”, “employee_id”, “shop_id”, and “task_assigned”. Wonderful! Now Susie is a bit more ready for any analysis she wants to do with the data.

Step 2 - Removing duplicate rows

We will explore the remove_duplicates function from the wrangle_in_py package.

This function is designed to help clean up messy datasets by identifying and removing duplicate rows based on specified columns. Duplicate rows in datasets can lead to skewed results and inaccurate analysis, particularly when analyzing employee schedules and tasks, so it’s crucial to handle them effectively as part of your data cleaning workflow.

Susie, a scheduling manager, needs to analyze staffing patterns across shifts to optimize scheduling. Her dataset contains information on employee shifts, but duplicate entries could distort her analysis. Using remove_duplicates, Susie can streamline her data by removing rows with the same Shift Time and Employee ID, keeping only the first occurrence of each duplicate.

Notice that several rows with duplicate “Shift Time” and “Employee ID” values exist at the bottom of the dataset. This redundancy could lead to incorrect staffing insights.

new_df
shift_time employee_id shop_id task_assigned
0 2025-01-01 09:00:00 101 1 Inventory
1 2025-01-01 17:00:00 102 1 Cashier
2 2025-01-04 01:00:00 103 1 None
3 2025-01-02 09:00:00 104 1 Cleaning
4 2025-01-02 17:00:00 105 1 Cashier
5 2025-01-03 01:00:00 106 1 Stocking
6 2025-01-05 09:00:00 107 1 Inventory
7 2025-01-03 17:00:00 102 1 None
8 2025-01-04 01:00:00 103 1 Cleaning
9 2025-01-04 09:00:00 108 1 Cashier
10 2025-01-04 17:00:00 109 1 Stocking
11 2025-01-05 01:00:00 110 1 Inventory
12 2025-01-05 09:00:00 101 1 Cashier
13 2025-01-05 17:00:00 111 1 None
14 2025-01-06 01:00:00 112 1 Cleaning
15 2025-01-06 01:00:00 112 1 Cleaning
16 2025-01-06 01:00:00 112 1 Cleaning
17 2025-01-06 01:00:00 112 1 Cleaning
from wrangle_in_py.remove_duplicates import remove_duplicates

no_duplicates_df = remove_duplicates(new_df, subset_columns=["shift_time", "employee_id"], keep="first")
4 rows have been dropped.
no_duplicates_df
shift_time employee_id shop_id task_assigned
0 2025-01-01 09:00:00 101 1 Inventory
1 2025-01-01 17:00:00 102 1 Cashier
2 2025-01-04 01:00:00 103 1 None
3 2025-01-02 09:00:00 104 1 Cleaning
4 2025-01-02 17:00:00 105 1 Cashier
5 2025-01-03 01:00:00 106 1 Stocking
6 2025-01-05 09:00:00 107 1 Inventory
7 2025-01-03 17:00:00 102 1 None
9 2025-01-04 09:00:00 108 1 Cashier
10 2025-01-04 17:00:00 109 1 Stocking
11 2025-01-05 01:00:00 110 1 Inventory
12 2025-01-05 09:00:00 101 1 Cashier
13 2025-01-05 17:00:00 111 1 None
14 2025-01-06 01:00:00 112 1 Cleaning

In this example, 4 duplicate rows with the same ‘Shift Time’ and ‘Employee ID’ were removed. The ‘keep=”first”’ parameter ensures that the first occurrence of a duplicate is retained while subsequent duplicates are dropped. This ensures that Susie’s data is clean and ready for analysis. Susie’s dataset is now free of duplicate rows, enabling her to generate insights into staffing patterns and scheduling needs without concerns about redundant data entries.

Step 3 - Removing columns based on missing data and/or low coefficient of variance

Next Susie is wanting to simplify things for herself later on and wants to remove any columns that aren’t useful for her. She’s decided that any column missing more than 10% of data should be removed since it won’t be useful for her later when she’s deciding how many employees are needed in a new shop. She always wants to remove any columns that have redundant info, aka the same values for all the cells, since it’s not adding anything to her dataset currently and she finds it distracting to have columns included if they don’t have a purpose.

Thus Susie wants to remove the columns “task_assigned” (for missing data), and “shop_id” (for redundant info). She could do this by examining her dataframe, noting down the columns to remove, then deleting the columns with the pandas .drop() function. But there’s a better way! The function column_drop_threshold in this package will calculate the proportion of missing data in all the columns and remove the columns that have too much missing data, based on a threshold specified by Susie (or you, the user)! The function also has an optional argument which when included will delete any columns that have a lower coefficient of variance than specified. In this way if there’s a column with the same value stored for every example we can remove it without having to examine the dataframe to identify columns that fit this description.

As such Susie has decided to remove any columns with more than 10% missing data, and any columns with a coefficient of variance lower than 0.001.

from wrangle_in_py.column_drop_threshold import column_drop_threshold

# Note that the default for variance is None, so we don't have to include it if we don't want columns removed on that basis
no_missing_df = column_drop_threshold(no_duplicates_df, threshold=0.1, variance=0.001)
no_missing_df
shift_time employee_id
0 2025-01-01 09:00:00 101
1 2025-01-01 17:00:00 102
2 2025-01-04 01:00:00 103
3 2025-01-02 09:00:00 104
4 2025-01-02 17:00:00 105
5 2025-01-03 01:00:00 106
6 2025-01-05 09:00:00 107
7 2025-01-03 17:00:00 102
9 2025-01-04 09:00:00 108
10 2025-01-04 17:00:00 109
11 2025-01-05 01:00:00 110
12 2025-01-05 09:00:00 101
13 2025-01-05 17:00:00 111
14 2025-01-06 01:00:00 112

Wonderful! Now we can see the columns we wanted removed are gone, and rather than going through the entire dataframe manually calculating missing amounts of data and coefficients of variance, the function did it for us and saved us some time! Now Susie can spend her free time finding great locations for a new shop and training more employees!

Step 4 - Separating the datetime column into different parts

Now Susie decides she wants to have the date column extracted into different columns, so that year, month, day, etc. are all in separate columns. This way she feels free to remove info that is irrelevant to the specific data analysis she wants to do, for example sometimes she may want only month and day in order to see trends over the months, whereas in other cases she may want to focus on the minutes and hours in order to understand daily staffing needs.

The extracting_ymd and extracting_hms function will allow Susie to create three more columns (extracting_ymd creates year, month, and day, whereas extracting_hms creates hour, minute, and second respectively) from the specified datetime column. There are two compulsory arguments. The first argument, df, corresponds to the dataframe containing the datetime column. The second argument is column indicating the name of datetime column being extracted. The output returns a copy of the input DataFrame with three new columns.

Extracting year, month, day

To start with, the dataframe no_missing_df from above is inputted and shift_time is the datetime column to be extracted.

from wrangle_in_py.extracting_ymd_hms import extracting_ymd

clean_date_df = extracting_ymd(no_missing_df, "shift_time")

clean_date_df
shift_time employee_id shift_time_year shift_time_month shift_time_day
0 2025-01-01 09:00:00 101 2025 1 1
1 2025-01-01 17:00:00 102 2025 1 1
2 2025-01-04 01:00:00 103 2025 1 4
3 2025-01-02 09:00:00 104 2025 1 2
4 2025-01-02 17:00:00 105 2025 1 2
5 2025-01-03 01:00:00 106 2025 1 3
6 2025-01-05 09:00:00 107 2025 1 5
7 2025-01-03 17:00:00 102 2025 1 3
9 2025-01-04 09:00:00 108 2025 1 4
10 2025-01-04 17:00:00 109 2025 1 4
11 2025-01-05 01:00:00 110 2025 1 5
12 2025-01-05 09:00:00 101 2025 1 5
13 2025-01-05 17:00:00 111 2025 1 5
14 2025-01-06 01:00:00 112 2025 1 6

The shift_time is extracted and three columns, shift_time_year, shift_time_month, and shift_time_day are added.

Extracting hour, minute, second

After that, the dataframe clean_date_df is being further processed and shift_time is the datetime column to be extracted again.

from wrangle_in_py.extracting_ymd_hms import extracting_hms

clean_datetime_df = extracting_hms(clean_date_df, "shift_time")

clean_datetime_df
shift_time employee_id shift_time_year shift_time_month shift_time_day shift_time_hour shift_time_minute shift_time_second
0 2025-01-01 09:00:00 101 2025 1 1 9 0 0
1 2025-01-01 17:00:00 102 2025 1 1 17 0 0
2 2025-01-04 01:00:00 103 2025 1 4 1 0 0
3 2025-01-02 09:00:00 104 2025 1 2 9 0 0
4 2025-01-02 17:00:00 105 2025 1 2 17 0 0
5 2025-01-03 01:00:00 106 2025 1 3 1 0 0
6 2025-01-05 09:00:00 107 2025 1 5 9 0 0
7 2025-01-03 17:00:00 102 2025 1 3 17 0 0
9 2025-01-04 09:00:00 108 2025 1 4 9 0 0
10 2025-01-04 17:00:00 109 2025 1 4 17 0 0
11 2025-01-05 01:00:00 110 2025 1 5 1 0 0
12 2025-01-05 09:00:00 101 2025 1 5 9 0 0
13 2025-01-05 17:00:00 111 2025 1 5 17 0 0
14 2025-01-06 01:00:00 112 2025 1 6 1 0 0

The shift_time is further extracted and three columns, shift_time_hour, shift_time_minute, and shift_time_second are added.

Now her dataset has column names that are easy to work with, no duplicated rows, no columns with missing data or redundant info, and individual columns for date and time info Susie can go ahead and analyze her staffing needs and begin planning for her next location!

This ends our tutorial on the wrangle_in_py python package. If you require more functions relating to dataframes you can find more info on built-in functions in pandas by reading pandas documentation.