Data cleaning, filtering, and grouping are essential steps in the data analysis process. In Pandas, there are various methods to clean, filter, and group your data, which can help you prepare the data for further analysis. This article will cover how to clean, filter, and group data using Pandas with examples.
To start, import the Pandas library:
import pandas as pd
Data cleaning involves handling missing values, duplicates, and incorrect data types.
There are several methods to handle missing values:
# Creating a DataFrame with missing values data = { "Name": ["Alice", "Bob", "Charlie", None], "Age": [25, None, 35, 40], "City": ["New York", "Los Angeles", "Chicago", "San Francisco"] } df = pd.DataFrame(data) # Dropping rows with missing values df_cleaned = df.dropna() print(df_cleaned) # Filling missing values with a specific value df_filled = df.fillna({"Name": "Unknown", "Age": 30}) print(df_filled) # Filling missing values with the mean of the column df["Age"] = df["Age"].fillna(df["Age"].mean()) print(df)
You can identify and remove duplicate rows in a DataFrame:
# Creating a DataFrame with duplicate rows data = { "Name": ["Alice", "Bob", "Alice", "Charlie"], "Age": [25, 30, 25, 35] } df = pd.DataFrame(data) # Checking for duplicate rows duplicates = df.duplicated() print(duplicates) # Removing duplicate rows df_no_duplicates = df.drop_duplicates() print(df_no_duplicates)
You can convert data types using the astype()
function:
# Converting a column to a specific data type df["Age"] = df["Age"].astype(int) print(df.dtypes)
Filtering is the process of selecting rows from a DataFrame that meet specific criteria.
You can filter rows by applying conditions to columns:
# Filtering rows where Age is greater than 30 filtered_df = df[df["Age"] > 30] print(filtered_df) # Filtering rows where Name is 'Alice' filtered_df = df[df["Name"] == "Alice"] print(filtered_df)
You can combine multiple conditions using the &
(and) and |
(or) operators:
# Filtering rows where Age is greater than 25 and Name is 'Alice' filtered_df = df[(df["Age"] > 25) & (df["Name"] == "Alice")] print(filtered_df) # Filtering rows where Age is greater than 25 or Name is 'Bob' filtered_df = df[(df["Age"] > 25) | (df["Name"] == "Bob")] print(filtered_df)
Grouping is a way to aggregate data based on certain criteria. Pandas provides the groupby()
function to group data and perform aggregation operations such as sum, mean, or count on each group.
You can group data by a single column:
# Grouping by 'Name' and calculating the mean Age for each group grouped = df.groupby("Name")["Age"].mean() print(grouped)
You can also group data by multiple columns:
# Grouping by 'Name' and 'Age' and calculating the count of each group grouped = df.groupby(["Name", "Age"]).size() print(grouped)
You can perform multiple aggregation functions on grouped data:
# Grouping by 'Name' and calculating both the mean and sum of Age grouped = df.groupby("Name")["Age"].agg(["mean", "sum"]) print(grouped)
You can apply custom functions to grouped data using apply()
:
# Applying a custom function to each group def custom_function(group): return group["Age"].max() - group["Age"].min() grouped = df.groupby("Name").apply(custom_function) print(grouped)
Data cleaning, filtering, and grouping are essential skills when working with data. Pandas provides powerful functions to clean data by handling missing values, duplicates, and incorrect data types. It also allows you to filter data based on specific conditions and group data for aggregation and analysis. By mastering these techniques, you can efficiently prepare and analyze data in Python.