wrangle_in_py.remove_duplicates
Functions
|
Remove duplicate rows from a DataFrame based on specified columns. |
Module Contents
- wrangle_in_py.remove_duplicates.remove_duplicates(df, subset_columns=None, keep='first')[source]
Remove duplicate rows from a DataFrame based on specified columns.
- Parameters:
df (pd.DataFrame) – The dataframe to process.
subset_columns (list or None) – List of column names to consider for identifying duplicates. If None (default), consider all columns.
keep (str) – Determines which duplicates to keep: - ‘first’: Keep the first occurrence (default). - ‘last’: Keep the last occurrence. - False: Drop all duplicates.
- Raises:
ValueError : – If the input for df is not a pandas DataFrame. If any column in subset_columns is not a column in the input dataframe. If the input for keep is not ‘first’, ‘last’, or False.
- Returns:
pd.DataFrame
- Return type:
A DataFrame with duplicates removed.
Example
>>> data = {'A': [1, 2, 2, 4], 'B': [5, 6, 6, 8]} >>> df = pd.DataFrame(data) >>> remove_duplicates(df, subset_columns=['A']) A B 0 1 5 1 2 6 3 4 8