{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example usage of `wrangle_in_py` in a project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial you'll learn how to use some functions that help with some common data wrangling and tidying tasks along with the pandas python package. This will prepare the dataframe for any future analysis you might be interested in including exploratory data analysis, formal progress reports, machine learning, etc.\n",
    "For this tutorial we will create our own messy dataframe to work with as an exmaple. In order to keep it simple it has just 18 rows of data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To walk through the process of using our wrangle_in_py package we'll introduce Susie, who has opened her first ice cream shop and is eager to get going and eventually expand into more locations. However, before she gets too ahead of herself she's trying to collect some simple data about her employees from her shop that she's hoping to use later on. But for now it looks pretty messy so let's walk through cleaning up the dataframe with her!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.1.0\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import wrangle_in_py\n",
    "\n",
    "print(wrangle_in_py.__version__)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To begin we'll load Susie's dataframe in. We can see she's already got quite a few employees, their role in the shop and their shift start and end times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dict = {\n",
    "    \"Shift Time\": [\n",
    "        \"2025-01-01 09:00:00\", \"2025-01-01 17:00:00\", \"2025-01-04 01:00:00\", \n",
    "        \"2025-01-02 09:00:00\", \"2025-01-02 17:00:00\", \"2025-01-03 01:00:00\", \n",
    "        \"2025-01-05 09:00:00\", \"2025-01-03 17:00:00\", \"2025-01-04 01:00:00\", \n",
    "        \"2025-01-04 09:00:00\", \"2025-01-04 17:00:00\", \"2025-01-05 01:00:00\", \n",
    "        \"2025-01-05 09:00:00\", \"2025-01-05 17:00:00\", \"2025-01-06 01:00:00\",\n",
    "        \"2025-01-06 01:00:00\", \"2025-01-06 01:00:00\", \"2025-01-06 01:00:00\"\n",
    "    ],\n",
    "    \"Employee ID\": [101, 102, 103, 104, 105, 106, 107, 102, 103, 108, 109, 110, 101, 111, 112, 112, 112, 112],\n",
    "    \"Shop ID\": [1] * 18,\n",
    "    \"Task Assigned\": [\n",
    "        \"Inventory\", \"Cashier\", None, \"Cleaning\", \"Cashier\", \"Stocking\", \n",
    "        \"Inventory\", None, \"Cleaning\", \"Cashier\", \"Stocking\", \"Inventory\", \n",
    "        \"Cashier\", None, \"Cleaning\", \"Cleaning\", \"Cleaning\", \"Cleaning\"\n",
    "    ]\n",
    "}\n",
    "\n",
    "\n",
    "# Convert the dictionary into a pandas DataFrame\n",
    "df_from_dict = pd.DataFrame(data_dict)\n",
    "\n",
    "# Ensure the \"Shift Time\" column is parsed as datetime\n",
    "df_from_dict[\"Shift Time\"] = pd.to_datetime(df_from_dict[\"Shift Time\"])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now let's clean up this dataframe!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Renaming Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First off, Susie noticed that the column names of her data might cause issues down the line in her analysis since they include spaces. She also thinks it might be a good idea to have all of the column names in lowercase too.\n",
    "\n",
    "At first, she considers changing them all of the column names manually, but it would be an annoying task that is prone to errors. For instance, she may rename the \"Employee ID\" column to \"shop_id\" by accident!\n",
    "\n",
    "No worries! `column_name_standardizer` will come to the rescue! After taking in a pandas DataFrame as an input, it will return a copy of the dataframe with all of the column names converted into lowercase and spaces or any non-numerics converted into into underscores!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the dataframe before the column names are converted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Shift Time</th>\n",
       "      <th>Employee ID</th>\n",
       "      <th>Shop ID</th>\n",
       "      <th>Task Assigned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Shift Time  Employee ID  Shop ID Task Assigned\n",
       "0  2025-01-01 09:00:00          101        1     Inventory\n",
       "1  2025-01-01 17:00:00          102        1       Cashier\n",
       "2  2025-01-04 01:00:00          103        1          None\n",
       "3  2025-01-02 09:00:00          104        1      Cleaning\n",
       "4  2025-01-02 17:00:00          105        1       Cashier\n",
       "5  2025-01-03 01:00:00          106        1      Stocking\n",
       "6  2025-01-05 09:00:00          107        1     Inventory\n",
       "7  2025-01-03 17:00:00          102        1          None\n",
       "8  2025-01-04 01:00:00          103        1      Cleaning\n",
       "9  2025-01-04 09:00:00          108        1       Cashier\n",
       "10 2025-01-04 17:00:00          109        1      Stocking\n",
       "11 2025-01-05 01:00:00          110        1     Inventory\n",
       "12 2025-01-05 09:00:00          101        1       Cashier\n",
       "13 2025-01-05 17:00:00          111        1          None\n",
       "14 2025-01-06 01:00:00          112        1      Cleaning\n",
       "15 2025-01-06 01:00:00          112        1      Cleaning\n",
       "16 2025-01-06 01:00:00          112        1      Cleaning\n",
       "17 2025-01-06 01:00:00          112        1      Cleaning"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_from_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here is after:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "      <th>shop_id</th>\n",
       "      <th>task_assigned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id  shop_id task_assigned\n",
       "0  2025-01-01 09:00:00          101        1     Inventory\n",
       "1  2025-01-01 17:00:00          102        1       Cashier\n",
       "2  2025-01-04 01:00:00          103        1          None\n",
       "3  2025-01-02 09:00:00          104        1      Cleaning\n",
       "4  2025-01-02 17:00:00          105        1       Cashier\n",
       "5  2025-01-03 01:00:00          106        1      Stocking\n",
       "6  2025-01-05 09:00:00          107        1     Inventory\n",
       "7  2025-01-03 17:00:00          102        1          None\n",
       "8  2025-01-04 01:00:00          103        1      Cleaning\n",
       "9  2025-01-04 09:00:00          108        1       Cashier\n",
       "10 2025-01-04 17:00:00          109        1      Stocking\n",
       "11 2025-01-05 01:00:00          110        1     Inventory\n",
       "12 2025-01-05 09:00:00          101        1       Cashier\n",
       "13 2025-01-05 17:00:00          111        1          None\n",
       "14 2025-01-06 01:00:00          112        1      Cleaning\n",
       "15 2025-01-06 01:00:00          112        1      Cleaning\n",
       "16 2025-01-06 01:00:00          112        1      Cleaning\n",
       "17 2025-01-06 01:00:00          112        1      Cleaning"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from wrangle_in_py.column_name_standardizer import column_name_standardizer\n",
    "\n",
    "new_df = column_name_standardizer(df_from_dict)\n",
    "\n",
    "new_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the column names have been converted from \"Shift Time\", \"Employee ID\", \"Shop ID\", and \"Task Assigned\" to \"shift_time\", \"employee_id\", \"shop_id\", and \"task_assigned\". Wonderful! Now Susie is a bit more ready for any analysis she wants to do with the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Removing duplicate rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will explore the `remove_duplicates` function from the `wrangle_in_py` package. \n",
    "\n",
    "This function is designed to help clean up messy datasets by identifying and removing duplicate rows based on specified columns. Duplicate rows in datasets can lead to skewed results and inaccurate analysis, particularly when analyzing employee schedules and tasks, so it's crucial to handle them effectively as part of your data cleaning workflow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Susie, a scheduling manager, needs to analyze staffing patterns across shifts to optimize scheduling. Her dataset contains information on employee shifts, but duplicate entries could distort her analysis. Using `remove_duplicates`, Susie can streamline her data by removing rows with the same Shift Time and Employee ID, keeping only the first occurrence of each duplicate.\n",
    "\n",
    "Notice that several rows with duplicate \"Shift Time\" and \"Employee ID\" values exist at the bottom of the dataset. This redundancy could lead to incorrect staffing insights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "      <th>shop_id</th>\n",
       "      <th>task_assigned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id  shop_id task_assigned\n",
       "0  2025-01-01 09:00:00          101        1     Inventory\n",
       "1  2025-01-01 17:00:00          102        1       Cashier\n",
       "2  2025-01-04 01:00:00          103        1          None\n",
       "3  2025-01-02 09:00:00          104        1      Cleaning\n",
       "4  2025-01-02 17:00:00          105        1       Cashier\n",
       "5  2025-01-03 01:00:00          106        1      Stocking\n",
       "6  2025-01-05 09:00:00          107        1     Inventory\n",
       "7  2025-01-03 17:00:00          102        1          None\n",
       "8  2025-01-04 01:00:00          103        1      Cleaning\n",
       "9  2025-01-04 09:00:00          108        1       Cashier\n",
       "10 2025-01-04 17:00:00          109        1      Stocking\n",
       "11 2025-01-05 01:00:00          110        1     Inventory\n",
       "12 2025-01-05 09:00:00          101        1       Cashier\n",
       "13 2025-01-05 17:00:00          111        1          None\n",
       "14 2025-01-06 01:00:00          112        1      Cleaning\n",
       "15 2025-01-06 01:00:00          112        1      Cleaning\n",
       "16 2025-01-06 01:00:00          112        1      Cleaning\n",
       "17 2025-01-06 01:00:00          112        1      Cleaning"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4 rows have been dropped.\n"
     ]
    }
   ],
   "source": [
    "from wrangle_in_py.remove_duplicates import remove_duplicates\n",
    "\n",
    "no_duplicates_df = remove_duplicates(new_df, subset_columns=[\"shift_time\", \"employee_id\"], keep=\"first\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "      <th>shop_id</th>\n",
       "      <th>task_assigned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>1</td>\n",
       "      <td>Stocking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>1</td>\n",
       "      <td>Inventory</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>1</td>\n",
       "      <td>Cashier</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>1</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>1</td>\n",
       "      <td>Cleaning</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id  shop_id task_assigned\n",
       "0  2025-01-01 09:00:00          101        1     Inventory\n",
       "1  2025-01-01 17:00:00          102        1       Cashier\n",
       "2  2025-01-04 01:00:00          103        1          None\n",
       "3  2025-01-02 09:00:00          104        1      Cleaning\n",
       "4  2025-01-02 17:00:00          105        1       Cashier\n",
       "5  2025-01-03 01:00:00          106        1      Stocking\n",
       "6  2025-01-05 09:00:00          107        1     Inventory\n",
       "7  2025-01-03 17:00:00          102        1          None\n",
       "9  2025-01-04 09:00:00          108        1       Cashier\n",
       "10 2025-01-04 17:00:00          109        1      Stocking\n",
       "11 2025-01-05 01:00:00          110        1     Inventory\n",
       "12 2025-01-05 09:00:00          101        1       Cashier\n",
       "13 2025-01-05 17:00:00          111        1          None\n",
       "14 2025-01-06 01:00:00          112        1      Cleaning"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_duplicates_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, 4 duplicate rows with the same 'Shift Time' and 'Employee ID' were removed. The 'keep=\"first\"' parameter ensures that the first occurrence of a duplicate is retained while subsequent duplicates are dropped. This ensures that Susie's data is clean and ready for analysis. Susie’s dataset is now free of duplicate rows, enabling her to generate insights into staffing patterns and scheduling needs without concerns about redundant data entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Removing columns based on missing data and/or low coefficient of variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next Susie is wanting to simplify things for herself later on and wants to remove any columns that aren't useful for her. She's decided that any column missing more than 10% of data should be removed since it won't be useful for her later when she's deciding how many employees are needed in a new shop. She always wants to remove any columns that have redundant info, aka the same values for all the cells, since it's not adding anything to her dataset currently and she finds it distracting to have columns included if they don't have a purpose. \n",
    "\n",
    "Thus Susie wants to remove the columns \"task_assigned\" (for missing data), and \"shop_id\" (for redundant info). She could do this by examining her dataframe, noting down the columns to remove, then deleting the columns with the pandas .drop() function. But there's a better way! The function column_drop_threshold in this package will calculate the proportion of missing data in all the columns and remove the columns that have too much missing data, based on a threshold specified by Susie (or you, the user)! The function also has an optional argument which when included will delete any columns that have a lower coefficient of variance than specified. In this way if there's a column with the same value stored for every example we can remove it without having to examine the dataframe to identify columns that fit this description."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As such Susie has decided to remove any columns with more than 10% missing data, and any columns with a coefficient of variance lower than 0.001."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from wrangle_in_py.column_drop_threshold import column_drop_threshold\n",
    "\n",
    "# Note that the default for variance is None, so we don't have to include it if we don't want columns removed on that basis\n",
    "no_missing_df = column_drop_threshold(no_duplicates_df, threshold=0.1, variance=0.001)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id\n",
       "0  2025-01-01 09:00:00          101\n",
       "1  2025-01-01 17:00:00          102\n",
       "2  2025-01-04 01:00:00          103\n",
       "3  2025-01-02 09:00:00          104\n",
       "4  2025-01-02 17:00:00          105\n",
       "5  2025-01-03 01:00:00          106\n",
       "6  2025-01-05 09:00:00          107\n",
       "7  2025-01-03 17:00:00          102\n",
       "9  2025-01-04 09:00:00          108\n",
       "10 2025-01-04 17:00:00          109\n",
       "11 2025-01-05 01:00:00          110\n",
       "12 2025-01-05 09:00:00          101\n",
       "13 2025-01-05 17:00:00          111\n",
       "14 2025-01-06 01:00:00          112"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "no_missing_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wonderful! Now we can see the columns we wanted removed are gone, and rather than going through the entire dataframe manually calculating missing amounts of data and coefficients of variance, the function did it for us and saved us some time! Now Susie can spend her free time finding great locations for a new shop and training more employees!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Separating the datetime column into different parts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now Susie decides she wants to have the date column extracted into different columns, so that year, month, day, etc. are all in separate columns. This way she feels free to remove info that is irrelevant to the specific data analysis she wants to do, for example sometimes she may want only month and day in order to see trends over the months, whereas in other cases she may want to focus on the minutes and hours in order to understand daily staffing needs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `extracting_ymd` and `extracting_hms` function will allow Susie to create three more columns (`extracting_ymd` creates year, month, and day, whereas `extracting_hms` creates hour, minute, and second respectively) from the specified datetime column. There are two compulsory arguments. The first argument, `df`, corresponds to the dataframe containing the datetime column. The second argument is `column` indicating the name of datetime column being extracted. The output returns a copy of the input DataFrame with three new columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting year, month, day\n",
    "To start with, the dataframe `no_missing_df` from above is inputted and `shift_time` is the datetime column to be extracted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "      <th>shift_time_year</th>\n",
       "      <th>shift_time_month</th>\n",
       "      <th>shift_time_day</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id  shift_time_year  shift_time_month  \\\n",
       "0  2025-01-01 09:00:00          101             2025                 1   \n",
       "1  2025-01-01 17:00:00          102             2025                 1   \n",
       "2  2025-01-04 01:00:00          103             2025                 1   \n",
       "3  2025-01-02 09:00:00          104             2025                 1   \n",
       "4  2025-01-02 17:00:00          105             2025                 1   \n",
       "5  2025-01-03 01:00:00          106             2025                 1   \n",
       "6  2025-01-05 09:00:00          107             2025                 1   \n",
       "7  2025-01-03 17:00:00          102             2025                 1   \n",
       "9  2025-01-04 09:00:00          108             2025                 1   \n",
       "10 2025-01-04 17:00:00          109             2025                 1   \n",
       "11 2025-01-05 01:00:00          110             2025                 1   \n",
       "12 2025-01-05 09:00:00          101             2025                 1   \n",
       "13 2025-01-05 17:00:00          111             2025                 1   \n",
       "14 2025-01-06 01:00:00          112             2025                 1   \n",
       "\n",
       "    shift_time_day  \n",
       "0                1  \n",
       "1                1  \n",
       "2                4  \n",
       "3                2  \n",
       "4                2  \n",
       "5                3  \n",
       "6                5  \n",
       "7                3  \n",
       "9                4  \n",
       "10               4  \n",
       "11               5  \n",
       "12               5  \n",
       "13               5  \n",
       "14               6  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from wrangle_in_py.extracting_ymd_hms import extracting_ymd\n",
    "\n",
    "clean_date_df = extracting_ymd(no_missing_df, \"shift_time\")\n",
    "\n",
    "clean_date_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `shift_time` is extracted and three columns, `shift_time_year`, `shift_time_month`, and `shift_time_day` are added."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting hour, minute, second\n",
    "After that, the dataframe `clean_date_df` is being further processed and `shift_time` is the datetime column to be extracted again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shift_time</th>\n",
       "      <th>employee_id</th>\n",
       "      <th>shift_time_year</th>\n",
       "      <th>shift_time_month</th>\n",
       "      <th>shift_time_day</th>\n",
       "      <th>shift_time_hour</th>\n",
       "      <th>shift_time_minute</th>\n",
       "      <th>shift_time_second</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2025-01-01 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2025-01-01 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2025-01-04 01:00:00</td>\n",
       "      <td>103</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2025-01-02 09:00:00</td>\n",
       "      <td>104</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2025-01-02 17:00:00</td>\n",
       "      <td>105</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2025-01-03 01:00:00</td>\n",
       "      <td>106</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>107</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2025-01-03 17:00:00</td>\n",
       "      <td>102</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2025-01-04 09:00:00</td>\n",
       "      <td>108</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2025-01-04 17:00:00</td>\n",
       "      <td>109</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2025-01-05 01:00:00</td>\n",
       "      <td>110</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2025-01-05 09:00:00</td>\n",
       "      <td>101</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2025-01-05 17:00:00</td>\n",
       "      <td>111</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2025-01-06 01:00:00</td>\n",
       "      <td>112</td>\n",
       "      <td>2025</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shift_time  employee_id  shift_time_year  shift_time_month  \\\n",
       "0  2025-01-01 09:00:00          101             2025                 1   \n",
       "1  2025-01-01 17:00:00          102             2025                 1   \n",
       "2  2025-01-04 01:00:00          103             2025                 1   \n",
       "3  2025-01-02 09:00:00          104             2025                 1   \n",
       "4  2025-01-02 17:00:00          105             2025                 1   \n",
       "5  2025-01-03 01:00:00          106             2025                 1   \n",
       "6  2025-01-05 09:00:00          107             2025                 1   \n",
       "7  2025-01-03 17:00:00          102             2025                 1   \n",
       "9  2025-01-04 09:00:00          108             2025                 1   \n",
       "10 2025-01-04 17:00:00          109             2025                 1   \n",
       "11 2025-01-05 01:00:00          110             2025                 1   \n",
       "12 2025-01-05 09:00:00          101             2025                 1   \n",
       "13 2025-01-05 17:00:00          111             2025                 1   \n",
       "14 2025-01-06 01:00:00          112             2025                 1   \n",
       "\n",
       "    shift_time_day  shift_time_hour  shift_time_minute  shift_time_second  \n",
       "0                1                9                  0                  0  \n",
       "1                1               17                  0                  0  \n",
       "2                4                1                  0                  0  \n",
       "3                2                9                  0                  0  \n",
       "4                2               17                  0                  0  \n",
       "5                3                1                  0                  0  \n",
       "6                5                9                  0                  0  \n",
       "7                3               17                  0                  0  \n",
       "9                4                9                  0                  0  \n",
       "10               4               17                  0                  0  \n",
       "11               5                1                  0                  0  \n",
       "12               5                9                  0                  0  \n",
       "13               5               17                  0                  0  \n",
       "14               6                1                  0                  0  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from wrangle_in_py.extracting_ymd_hms import extracting_hms\n",
    "\n",
    "clean_datetime_df = extracting_hms(clean_date_df, \"shift_time\")\n",
    "\n",
    "clean_datetime_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `shift_time` is further extracted and three columns, `shift_time_hour`, `shift_time_minute`, and `shift_time_second` are added."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now her dataset has column names that are easy to work with, no duplicated rows, no columns with missing data or redundant info, and individual columns for date and time info Susie can go ahead and analyze her staffing needs and begin planning for her next location!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This ends our tutorial on the wrangle_in_py python package. If you require more functions relating to dataframes you can find more info on built-in functions in pandas by reading [pandas documentation](https://pandas.pydata.org/docs/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py_wrangle",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}