Pandas - Removing Duplicates

Removing duplicates in Pandas is a common operation, especially when dealing with datasets where duplicate rows may exist. You can remove duplicates based on one or more columns or consider the entire row for duplicates. Here's how you can do it:

Removing Duplicates Based on Columns

You can use the drop_duplicates() method to remove duplicate rows based on specific columns.


        import pandas as pd

        # Example DataFrame with duplicate rows
        data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
                'Age': [25, 30, 25, 35, 30],
                'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
        df = pd.DataFrame(data)
        
        # Removing duplicates based on the 'Name' column
        df_no_duplicates = df.drop_duplicates(subset=['Name'])
        print(df_no_duplicates)

Removing Complete Duplicate Rows

If you want to remove rows where all columns have identical values, you can use the drop_duplicates() method without specifying any subset.


        # Removing complete duplicate rows
        df_no_duplicates = df.drop_duplicates()
        print(df_no_duplicates)

Inplace Operation

By default, drop_duplicates() returns a new DataFrame with duplicates removed. If you want to modify the existing DataFrame in place, you can use the inplace=True argument.


        # Removing duplicates in place
        df.drop_duplicates(inplace=True)
        print(df)

Keeping the Last Occurrence

By default, drop_duplicates() keeps the first occurrence of a duplicate row and removes the rest. If you want to keep the last occurrence instead, you can use the keep='last' argument.

        
        # Keeping the last occurrence of duplicates
        df_no_duplicates = df.drop_duplicates(keep='last')
        print(df_no_duplicates)

These are the basic ways to remove duplicates in Pandas. Depending on your specific use case, you can customize the method with different arguments and options.

Cleaning Data

Correlation

Plotting

Pandas - Removing Duplicates

Removing Duplicates Based on Columns

Removing Complete Duplicate Rows

Inplace Operation

Keeping the Last Occurrence

Q3 Schools : India

Online Complier

Website Development

Campus Learning