{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage of `wrangle_in_py` in a project" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial you'll learn how to use some functions that help with some common data wrangling and tidying tasks along with the pandas python package. This will prepare the dataframe for any future analysis you might be interested in including exploratory data analysis, formal progress reports, machine learning, etc.\n", "For this tutorial we will create our own messy dataframe to work with as an exmaple. In order to keep it simple it has just 18 rows of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To walk through the process of using our wrangle_in_py package we'll introduce Susie, who has opened her first ice cream shop and is eager to get going and eventually expand into more locations. However, before she gets too ahead of herself she's trying to collect some simple data about her employees from her shop that she's hoping to use later on. But for now it looks pretty messy so let's walk through cleaning up the dataframe with her!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1.0\n" ] } ], "source": [ "import pandas as pd\n", "import wrangle_in_py\n", "\n", "print(wrangle_in_py.__version__)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin we'll load Susie's dataframe in. We can see she's already got quite a few employees, their role in the shop and their shift start and end times." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_dict = {\n", " \"Shift Time\": [\n", " \"2025-01-01 09:00:00\", \"2025-01-01 17:00:00\", \"2025-01-04 01:00:00\", \n", " \"2025-01-02 09:00:00\", \"2025-01-02 17:00:00\", \"2025-01-03 01:00:00\", \n", " \"2025-01-05 09:00:00\", \"2025-01-03 17:00:00\", \"2025-01-04 01:00:00\", \n", " \"2025-01-04 09:00:00\", \"2025-01-04 17:00:00\", \"2025-01-05 01:00:00\", \n", " \"2025-01-05 09:00:00\", \"2025-01-05 17:00:00\", \"2025-01-06 01:00:00\",\n", " \"2025-01-06 01:00:00\", \"2025-01-06 01:00:00\", \"2025-01-06 01:00:00\"\n", " ],\n", " \"Employee ID\": [101, 102, 103, 104, 105, 106, 107, 102, 103, 108, 109, 110, 101, 111, 112, 112, 112, 112],\n", " \"Shop ID\": [1] * 18,\n", " \"Task Assigned\": [\n", " \"Inventory\", \"Cashier\", None, \"Cleaning\", \"Cashier\", \"Stocking\", \n", " \"Inventory\", None, \"Cleaning\", \"Cashier\", \"Stocking\", \"Inventory\", \n", " \"Cashier\", None, \"Cleaning\", \"Cleaning\", \"Cleaning\", \"Cleaning\"\n", " ]\n", "}\n", "\n", "\n", "# Convert the dictionary into a pandas DataFrame\n", "df_from_dict = pd.DataFrame(data_dict)\n", "\n", "# Ensure the \"Shift Time\" column is parsed as datetime\n", "df_from_dict[\"Shift Time\"] = pd.to_datetime(df_from_dict[\"Shift Time\"])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now let's clean up this dataframe!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Renaming Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First off, Susie noticed that the column names of her data might cause issues down the line in her analysis since they include spaces. She also thinks it might be a good idea to have all of the column names in lowercase too.\n", "\n", "At first, she considers changing them all of the column names manually, but it would be an annoying task that is prone to errors. For instance, she may rename the \"Employee ID\" column to \"shop_id\" by accident!\n", "\n", "No worries! `column_name_standardizer` will come to the rescue! After taking in a pandas DataFrame as an input, it will return a copy of the dataframe with all of the column names converted into lowercase and spaces or any non-numerics converted into into underscores!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the dataframe before the column names are converted:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Shift TimeEmployee IDShop IDTask Assigned
02025-01-01 09:00:001011Inventory
12025-01-01 17:00:001021Cashier
22025-01-04 01:00:001031None
32025-01-02 09:00:001041Cleaning
42025-01-02 17:00:001051Cashier
52025-01-03 01:00:001061Stocking
62025-01-05 09:00:001071Inventory
72025-01-03 17:00:001021None
82025-01-04 01:00:001031Cleaning
92025-01-04 09:00:001081Cashier
102025-01-04 17:00:001091Stocking
112025-01-05 01:00:001101Inventory
122025-01-05 09:00:001011Cashier
132025-01-05 17:00:001111None
142025-01-06 01:00:001121Cleaning
152025-01-06 01:00:001121Cleaning
162025-01-06 01:00:001121Cleaning
172025-01-06 01:00:001121Cleaning
\n", "
" ], "text/plain": [ " Shift Time Employee ID Shop ID Task Assigned\n", "0 2025-01-01 09:00:00 101 1 Inventory\n", "1 2025-01-01 17:00:00 102 1 Cashier\n", "2 2025-01-04 01:00:00 103 1 None\n", "3 2025-01-02 09:00:00 104 1 Cleaning\n", "4 2025-01-02 17:00:00 105 1 Cashier\n", "5 2025-01-03 01:00:00 106 1 Stocking\n", "6 2025-01-05 09:00:00 107 1 Inventory\n", "7 2025-01-03 17:00:00 102 1 None\n", "8 2025-01-04 01:00:00 103 1 Cleaning\n", "9 2025-01-04 09:00:00 108 1 Cashier\n", "10 2025-01-04 17:00:00 109 1 Stocking\n", "11 2025-01-05 01:00:00 110 1 Inventory\n", "12 2025-01-05 09:00:00 101 1 Cashier\n", "13 2025-01-05 17:00:00 111 1 None\n", "14 2025-01-06 01:00:00 112 1 Cleaning\n", "15 2025-01-06 01:00:00 112 1 Cleaning\n", "16 2025-01-06 01:00:00 112 1 Cleaning\n", "17 2025-01-06 01:00:00 112 1 Cleaning" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_from_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here is after:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_idshop_idtask_assigned
02025-01-01 09:00:001011Inventory
12025-01-01 17:00:001021Cashier
22025-01-04 01:00:001031None
32025-01-02 09:00:001041Cleaning
42025-01-02 17:00:001051Cashier
52025-01-03 01:00:001061Stocking
62025-01-05 09:00:001071Inventory
72025-01-03 17:00:001021None
82025-01-04 01:00:001031Cleaning
92025-01-04 09:00:001081Cashier
102025-01-04 17:00:001091Stocking
112025-01-05 01:00:001101Inventory
122025-01-05 09:00:001011Cashier
132025-01-05 17:00:001111None
142025-01-06 01:00:001121Cleaning
152025-01-06 01:00:001121Cleaning
162025-01-06 01:00:001121Cleaning
172025-01-06 01:00:001121Cleaning
\n", "
" ], "text/plain": [ " shift_time employee_id shop_id task_assigned\n", "0 2025-01-01 09:00:00 101 1 Inventory\n", "1 2025-01-01 17:00:00 102 1 Cashier\n", "2 2025-01-04 01:00:00 103 1 None\n", "3 2025-01-02 09:00:00 104 1 Cleaning\n", "4 2025-01-02 17:00:00 105 1 Cashier\n", "5 2025-01-03 01:00:00 106 1 Stocking\n", "6 2025-01-05 09:00:00 107 1 Inventory\n", "7 2025-01-03 17:00:00 102 1 None\n", "8 2025-01-04 01:00:00 103 1 Cleaning\n", "9 2025-01-04 09:00:00 108 1 Cashier\n", "10 2025-01-04 17:00:00 109 1 Stocking\n", "11 2025-01-05 01:00:00 110 1 Inventory\n", "12 2025-01-05 09:00:00 101 1 Cashier\n", "13 2025-01-05 17:00:00 111 1 None\n", "14 2025-01-06 01:00:00 112 1 Cleaning\n", "15 2025-01-06 01:00:00 112 1 Cleaning\n", "16 2025-01-06 01:00:00 112 1 Cleaning\n", "17 2025-01-06 01:00:00 112 1 Cleaning" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from wrangle_in_py.column_name_standardizer import column_name_standardizer\n", "\n", "new_df = column_name_standardizer(df_from_dict)\n", "\n", "new_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the column names have been converted from \"Shift Time\", \"Employee ID\", \"Shop ID\", and \"Task Assigned\" to \"shift_time\", \"employee_id\", \"shop_id\", and \"task_assigned\". Wonderful! Now Susie is a bit more ready for any analysis she wants to do with the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Removing duplicate rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will explore the `remove_duplicates` function from the `wrangle_in_py` package. \n", "\n", "This function is designed to help clean up messy datasets by identifying and removing duplicate rows based on specified columns. Duplicate rows in datasets can lead to skewed results and inaccurate analysis, particularly when analyzing employee schedules and tasks, so it's crucial to handle them effectively as part of your data cleaning workflow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Susie, a scheduling manager, needs to analyze staffing patterns across shifts to optimize scheduling. Her dataset contains information on employee shifts, but duplicate entries could distort her analysis. Using `remove_duplicates`, Susie can streamline her data by removing rows with the same Shift Time and Employee ID, keeping only the first occurrence of each duplicate.\n", "\n", "Notice that several rows with duplicate \"Shift Time\" and \"Employee ID\" values exist at the bottom of the dataset. This redundancy could lead to incorrect staffing insights." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_idshop_idtask_assigned
02025-01-01 09:00:001011Inventory
12025-01-01 17:00:001021Cashier
22025-01-04 01:00:001031None
32025-01-02 09:00:001041Cleaning
42025-01-02 17:00:001051Cashier
52025-01-03 01:00:001061Stocking
62025-01-05 09:00:001071Inventory
72025-01-03 17:00:001021None
82025-01-04 01:00:001031Cleaning
92025-01-04 09:00:001081Cashier
102025-01-04 17:00:001091Stocking
112025-01-05 01:00:001101Inventory
122025-01-05 09:00:001011Cashier
132025-01-05 17:00:001111None
142025-01-06 01:00:001121Cleaning
152025-01-06 01:00:001121Cleaning
162025-01-06 01:00:001121Cleaning
172025-01-06 01:00:001121Cleaning
\n", "
" ], "text/plain": [ " shift_time employee_id shop_id task_assigned\n", "0 2025-01-01 09:00:00 101 1 Inventory\n", "1 2025-01-01 17:00:00 102 1 Cashier\n", "2 2025-01-04 01:00:00 103 1 None\n", "3 2025-01-02 09:00:00 104 1 Cleaning\n", "4 2025-01-02 17:00:00 105 1 Cashier\n", "5 2025-01-03 01:00:00 106 1 Stocking\n", "6 2025-01-05 09:00:00 107 1 Inventory\n", "7 2025-01-03 17:00:00 102 1 None\n", "8 2025-01-04 01:00:00 103 1 Cleaning\n", "9 2025-01-04 09:00:00 108 1 Cashier\n", "10 2025-01-04 17:00:00 109 1 Stocking\n", "11 2025-01-05 01:00:00 110 1 Inventory\n", "12 2025-01-05 09:00:00 101 1 Cashier\n", "13 2025-01-05 17:00:00 111 1 None\n", "14 2025-01-06 01:00:00 112 1 Cleaning\n", "15 2025-01-06 01:00:00 112 1 Cleaning\n", "16 2025-01-06 01:00:00 112 1 Cleaning\n", "17 2025-01-06 01:00:00 112 1 Cleaning" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 rows have been dropped.\n" ] } ], "source": [ "from wrangle_in_py.remove_duplicates import remove_duplicates\n", "\n", "no_duplicates_df = remove_duplicates(new_df, subset_columns=[\"shift_time\", \"employee_id\"], keep=\"first\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_idshop_idtask_assigned
02025-01-01 09:00:001011Inventory
12025-01-01 17:00:001021Cashier
22025-01-04 01:00:001031None
32025-01-02 09:00:001041Cleaning
42025-01-02 17:00:001051Cashier
52025-01-03 01:00:001061Stocking
62025-01-05 09:00:001071Inventory
72025-01-03 17:00:001021None
92025-01-04 09:00:001081Cashier
102025-01-04 17:00:001091Stocking
112025-01-05 01:00:001101Inventory
122025-01-05 09:00:001011Cashier
132025-01-05 17:00:001111None
142025-01-06 01:00:001121Cleaning
\n", "
" ], "text/plain": [ " shift_time employee_id shop_id task_assigned\n", "0 2025-01-01 09:00:00 101 1 Inventory\n", "1 2025-01-01 17:00:00 102 1 Cashier\n", "2 2025-01-04 01:00:00 103 1 None\n", "3 2025-01-02 09:00:00 104 1 Cleaning\n", "4 2025-01-02 17:00:00 105 1 Cashier\n", "5 2025-01-03 01:00:00 106 1 Stocking\n", "6 2025-01-05 09:00:00 107 1 Inventory\n", "7 2025-01-03 17:00:00 102 1 None\n", "9 2025-01-04 09:00:00 108 1 Cashier\n", "10 2025-01-04 17:00:00 109 1 Stocking\n", "11 2025-01-05 01:00:00 110 1 Inventory\n", "12 2025-01-05 09:00:00 101 1 Cashier\n", "13 2025-01-05 17:00:00 111 1 None\n", "14 2025-01-06 01:00:00 112 1 Cleaning" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_duplicates_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, 4 duplicate rows with the same 'Shift Time' and 'Employee ID' were removed. The 'keep=\"first\"' parameter ensures that the first occurrence of a duplicate is retained while subsequent duplicates are dropped. This ensures that Susie's data is clean and ready for analysis. Susie’s dataset is now free of duplicate rows, enabling her to generate insights into staffing patterns and scheduling needs without concerns about redundant data entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Removing columns based on missing data and/or low coefficient of variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next Susie is wanting to simplify things for herself later on and wants to remove any columns that aren't useful for her. She's decided that any column missing more than 10% of data should be removed since it won't be useful for her later when she's deciding how many employees are needed in a new shop. She always wants to remove any columns that have redundant info, aka the same values for all the cells, since it's not adding anything to her dataset currently and she finds it distracting to have columns included if they don't have a purpose. \n", "\n", "Thus Susie wants to remove the columns \"task_assigned\" (for missing data), and \"shop_id\" (for redundant info). She could do this by examining her dataframe, noting down the columns to remove, then deleting the columns with the pandas .drop() function. But there's a better way! The function column_drop_threshold in this package will calculate the proportion of missing data in all the columns and remove the columns that have too much missing data, based on a threshold specified by Susie (or you, the user)! The function also has an optional argument which when included will delete any columns that have a lower coefficient of variance than specified. In this way if there's a column with the same value stored for every example we can remove it without having to examine the dataframe to identify columns that fit this description." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As such Susie has decided to remove any columns with more than 10% missing data, and any columns with a coefficient of variance lower than 0.001." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from wrangle_in_py.column_drop_threshold import column_drop_threshold\n", "\n", "# Note that the default for variance is None, so we don't have to include it if we don't want columns removed on that basis\n", "no_missing_df = column_drop_threshold(no_duplicates_df, threshold=0.1, variance=0.001)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_id
02025-01-01 09:00:00101
12025-01-01 17:00:00102
22025-01-04 01:00:00103
32025-01-02 09:00:00104
42025-01-02 17:00:00105
52025-01-03 01:00:00106
62025-01-05 09:00:00107
72025-01-03 17:00:00102
92025-01-04 09:00:00108
102025-01-04 17:00:00109
112025-01-05 01:00:00110
122025-01-05 09:00:00101
132025-01-05 17:00:00111
142025-01-06 01:00:00112
\n", "
" ], "text/plain": [ " shift_time employee_id\n", "0 2025-01-01 09:00:00 101\n", "1 2025-01-01 17:00:00 102\n", "2 2025-01-04 01:00:00 103\n", "3 2025-01-02 09:00:00 104\n", "4 2025-01-02 17:00:00 105\n", "5 2025-01-03 01:00:00 106\n", "6 2025-01-05 09:00:00 107\n", "7 2025-01-03 17:00:00 102\n", "9 2025-01-04 09:00:00 108\n", "10 2025-01-04 17:00:00 109\n", "11 2025-01-05 01:00:00 110\n", "12 2025-01-05 09:00:00 101\n", "13 2025-01-05 17:00:00 111\n", "14 2025-01-06 01:00:00 112" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_missing_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wonderful! Now we can see the columns we wanted removed are gone, and rather than going through the entire dataframe manually calculating missing amounts of data and coefficients of variance, the function did it for us and saved us some time! Now Susie can spend her free time finding great locations for a new shop and training more employees!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Separating the datetime column into different parts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Susie decides she wants to have the date column extracted into different columns, so that year, month, day, etc. are all in separate columns. This way she feels free to remove info that is irrelevant to the specific data analysis she wants to do, for example sometimes she may want only month and day in order to see trends over the months, whereas in other cases she may want to focus on the minutes and hours in order to understand daily staffing needs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `extracting_ymd` and `extracting_hms` function will allow Susie to create three more columns (`extracting_ymd` creates year, month, and day, whereas `extracting_hms` creates hour, minute, and second respectively) from the specified datetime column. There are two compulsory arguments. The first argument, `df`, corresponds to the dataframe containing the datetime column. The second argument is `column` indicating the name of datetime column being extracted. The output returns a copy of the input DataFrame with three new columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting year, month, day\n", "To start with, the dataframe `no_missing_df` from above is inputted and `shift_time` is the datetime column to be extracted." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_idshift_time_yearshift_time_monthshift_time_day
02025-01-01 09:00:00101202511
12025-01-01 17:00:00102202511
22025-01-04 01:00:00103202514
32025-01-02 09:00:00104202512
42025-01-02 17:00:00105202512
52025-01-03 01:00:00106202513
62025-01-05 09:00:00107202515
72025-01-03 17:00:00102202513
92025-01-04 09:00:00108202514
102025-01-04 17:00:00109202514
112025-01-05 01:00:00110202515
122025-01-05 09:00:00101202515
132025-01-05 17:00:00111202515
142025-01-06 01:00:00112202516
\n", "
" ], "text/plain": [ " shift_time employee_id shift_time_year shift_time_month \\\n", "0 2025-01-01 09:00:00 101 2025 1 \n", "1 2025-01-01 17:00:00 102 2025 1 \n", "2 2025-01-04 01:00:00 103 2025 1 \n", "3 2025-01-02 09:00:00 104 2025 1 \n", "4 2025-01-02 17:00:00 105 2025 1 \n", "5 2025-01-03 01:00:00 106 2025 1 \n", "6 2025-01-05 09:00:00 107 2025 1 \n", "7 2025-01-03 17:00:00 102 2025 1 \n", "9 2025-01-04 09:00:00 108 2025 1 \n", "10 2025-01-04 17:00:00 109 2025 1 \n", "11 2025-01-05 01:00:00 110 2025 1 \n", "12 2025-01-05 09:00:00 101 2025 1 \n", "13 2025-01-05 17:00:00 111 2025 1 \n", "14 2025-01-06 01:00:00 112 2025 1 \n", "\n", " shift_time_day \n", "0 1 \n", "1 1 \n", "2 4 \n", "3 2 \n", "4 2 \n", "5 3 \n", "6 5 \n", "7 3 \n", "9 4 \n", "10 4 \n", "11 5 \n", "12 5 \n", "13 5 \n", "14 6 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from wrangle_in_py.extracting_ymd_hms import extracting_ymd\n", "\n", "clean_date_df = extracting_ymd(no_missing_df, \"shift_time\")\n", "\n", "clean_date_df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `shift_time` is extracted and three columns, `shift_time_year`, `shift_time_month`, and `shift_time_day` are added." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting hour, minute, second\n", "After that, the dataframe `clean_date_df` is being further processed and `shift_time` is the datetime column to be extracted again." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
shift_timeemployee_idshift_time_yearshift_time_monthshift_time_dayshift_time_hourshift_time_minuteshift_time_second
02025-01-01 09:00:00101202511900
12025-01-01 17:00:001022025111700
22025-01-04 01:00:00103202514100
32025-01-02 09:00:00104202512900
42025-01-02 17:00:001052025121700
52025-01-03 01:00:00106202513100
62025-01-05 09:00:00107202515900
72025-01-03 17:00:001022025131700
92025-01-04 09:00:00108202514900
102025-01-04 17:00:001092025141700
112025-01-05 01:00:00110202515100
122025-01-05 09:00:00101202515900
132025-01-05 17:00:001112025151700
142025-01-06 01:00:00112202516100
\n", "
" ], "text/plain": [ " shift_time employee_id shift_time_year shift_time_month \\\n", "0 2025-01-01 09:00:00 101 2025 1 \n", "1 2025-01-01 17:00:00 102 2025 1 \n", "2 2025-01-04 01:00:00 103 2025 1 \n", "3 2025-01-02 09:00:00 104 2025 1 \n", "4 2025-01-02 17:00:00 105 2025 1 \n", "5 2025-01-03 01:00:00 106 2025 1 \n", "6 2025-01-05 09:00:00 107 2025 1 \n", "7 2025-01-03 17:00:00 102 2025 1 \n", "9 2025-01-04 09:00:00 108 2025 1 \n", "10 2025-01-04 17:00:00 109 2025 1 \n", "11 2025-01-05 01:00:00 110 2025 1 \n", "12 2025-01-05 09:00:00 101 2025 1 \n", "13 2025-01-05 17:00:00 111 2025 1 \n", "14 2025-01-06 01:00:00 112 2025 1 \n", "\n", " shift_time_day shift_time_hour shift_time_minute shift_time_second \n", "0 1 9 0 0 \n", "1 1 17 0 0 \n", "2 4 1 0 0 \n", "3 2 9 0 0 \n", "4 2 17 0 0 \n", "5 3 1 0 0 \n", "6 5 9 0 0 \n", "7 3 17 0 0 \n", "9 4 9 0 0 \n", "10 4 17 0 0 \n", "11 5 1 0 0 \n", "12 5 9 0 0 \n", "13 5 17 0 0 \n", "14 6 1 0 0 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from wrangle_in_py.extracting_ymd_hms import extracting_hms\n", "\n", "clean_datetime_df = extracting_hms(clean_date_df, \"shift_time\")\n", "\n", "clean_datetime_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `shift_time` is further extracted and three columns, `shift_time_hour`, `shift_time_minute`, and `shift_time_second` are added." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now her dataset has column names that are easy to work with, no duplicated rows, no columns with missing data or redundant info, and individual columns for date and time info Susie can go ahead and analyze her staffing needs and begin planning for her next location!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This ends our tutorial on the wrangle_in_py python package. If you require more functions relating to dataframes you can find more info on built-in functions in pandas by reading [pandas documentation](https://pandas.pydata.org/docs/)." ] } ], "metadata": { "kernelspec": { "display_name": "py_wrangle", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 4 }