Removing duplicates in Pandas is a common operation, especially when dealing with datasets where duplicate rows may exist. You can remove duplicates based on one or more columns or consider the entire row for duplicates. Here's how you can do it:
You can use the drop_duplicates()
method to remove duplicate rows based on specific columns.
import pandas as pd # Example DataFrame with duplicate rows data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'], 'Age': [25, 30, 25, 35, 30], 'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']} df = pd.DataFrame(data) # Removing duplicates based on the 'Name' column df_no_duplicates = df.drop_duplicates(subset=['Name']) print(df_no_duplicates)
If you want to remove rows where all columns have identical values, you can use the drop_duplicates()
method without specifying any subset.
# Removing complete duplicate rows df_no_duplicates = df.drop_duplicates() print(df_no_duplicates)
By default, drop_duplicates()
returns a new DataFrame with duplicates removed. If you want to modify the existing DataFrame in place, you can use the inplace=True
argument.
# Removing duplicates in place df.drop_duplicates(inplace=True) print(df)
By default, drop_duplicates()
keeps the first occurrence of a duplicate row and removes the rest. If you want to keep the last occurrence instead, you can use the keep='last'
argument.
# Keeping the last occurrence of duplicates df_no_duplicates = df.drop_duplicates(keep='last') print(df_no_duplicates)
These are the basic ways to remove duplicates in Pandas. Depending on your specific use case, you can customize the method with different arguments and options.