Pandas is a powerful library in Python used for data analysis and manipulation. It provides a range of functions to perform basic operations on data such as summarizing, filtering, and aggregating. This article explores some of the basic data analysis operations you can perform using Pandas.
Before starting with the data analysis, you need to import the Pandas library:
import pandas as pd
To begin with, you need to load your data into a Pandas DataFrame. Data can be loaded from various sources like CSV, Excel, or a SQL database. Here is an example of loading data from a CSV file:
# Reading a CSV file into a DataFrame df = pd.read_csv("data.csv") print(df)
Once the data is loaded, you can explore it to understand its structure. Some common methods for exploring the data include:
The head()
method is used to view the first few rows of the DataFrame:
# Displaying the first 5 rows of the DataFrame print(df.head())
The info()
method gives information about the DataFrame, such as the column names, non-null counts, and data types:
# Displaying the DataFrame info print(df.info())
The describe()
method provides summary statistics of numeric columns:
# Getting summary statistics of the DataFrame print(df.describe())
In Pandas, you can select data based on column names, row indices, or specific conditions.
You can select one or more columns from a DataFrame:
# Selecting a single column age_column = df["Age"] # Selecting multiple columns age_and_name = df[["Age", "Name"]]
You can select rows based on their index position using the iloc[]
function:
# Selecting the first row first_row = df.iloc[0] # Selecting rows based on index range rows_range = df.iloc[1:5]
Filtering rows based on specific conditions can be done using boolean indexing:
# Selecting rows where Age is greater than 30 filtered_df = df[df["Age"] > 30] print(filtered_df)
In real-world data, it’s common to have missing or null values. Pandas provides several ways to handle these missing values.
Use the isnull()
function to check for missing values:
# Identifying missing values in the DataFrame missing_values = df.isnull() print(missing_values)
You can remove rows with missing values using the dropna()
function:
# Dropping rows with missing values df_cleaned = df.dropna() print(df_cleaned)
To fill missing values, use the fillna()
function. You can replace missing values with a specific value, such as the mean:
# Filling missing values with the mean of the column df["Age"] = df["Age"].fillna(df["Age"].mean()) print(df)
Pandas provides powerful functions for grouping and aggregating data.
You can group data based on one or more columns using the groupby()
function:
# Grouping by 'City' and calculating the mean age in each city grouped = df.groupby("City")["Age"].mean() print(grouped)
You can perform multiple aggregation functions on grouped data:
# Grouping by 'City' and calculating both mean and sum of age aggregated = df.groupby("City")["Age"].agg(["mean", "sum"]) print(aggregated)
Sorting data is a common operation in data analysis. You can sort data by one or more columns:
Use the sort_values()
function to sort the DataFrame by a specific column:
# Sorting by the 'Age' column in ascending order sorted_df = df.sort_values("Age") print(sorted_df)
You can also sort by multiple columns by passing a list of column names:
# Sorting by 'City' and 'Age' sorted_df = df.sort_values(["City", "Age"], ascending=[True, False]) print(sorted_df)
Pandas provides a wide range of operations for basic data analysis, from loading and exploring data to cleaning, filtering, grouping, and sorting. These operations form the foundation for more advanced data manipulation and analysis. By mastering these basic operations, you can begin to uncover valuable insights from your data.