SQL provides a set of powerful functions to perform statistical analysis on data stored in relational databases. These functions help users to compute and analyze various statistical measures like averages, sums, counts, and variances. In this article, we will explore some commonly used statistical functions in SQL with examples.
The COUNT()
function is used to count the number of rows that match a specified condition. It is particularly useful for counting the number of records in a table or the number of records that satisfy certain criteria.
If you want to count the total number of employees in the employees
table:
SELECT COUNT(*) AS total_employees FROM employees;
This query will return the total number of rows (employees) in the table.
The SUM()
function is used to calculate the total sum of a numeric column. It is useful when you need to calculate the total sales, total expenses, or any other aggregate value.
To calculate the total sales from the sales
table:
SELECT SUM(sales_amount) AS total_sales FROM sales;
This query will return the sum of all the values in the sales_amount
column.
The AVG()
function computes the average value of a numeric column. It is often used to determine the average salary, average score, or average price from a set of data.
To calculate the average salary of employees in the employees
table:
SELECT AVG(salary) AS average_salary FROM employees;
This query will return the average salary of all employees in the table.
The MIN()
and MAX()
functions are used to find the minimum and maximum values in a column, respectively. These functions are useful for determining the smallest and largest values in your dataset.
To find the lowest and highest salary in the employees
table:
SELECT MIN(salary) AS lowest_salary, MAX(salary) AS highest_salary FROM employees;
This query will return the minimum and maximum salaries from the salary
column.
Variance and standard deviation are statistical measures that show the spread or dispersion of data. The VARIANCE()
function calculates the variance, while the STDDEV()
function calculates the standard deviation.
To calculate the variance and standard deviation of the employees' salaries:
SELECT VARIANCE(salary) AS salary_variance, STDDEV(salary) AS salary_stddev FROM employees;
This query will return the variance and standard deviation of the salary data in the salary
column.
SQL does not have a built-in MEDIAN()
function in most databases. However, you can approximate the median using window functions or custom queries. The median is the middle value in a dataset when arranged in ascending order.
To calculate the median salary, you can use a query like this:
SELECT salary FROM (SELECT salary, ROW_NUMBER() OVER (ORDER BY salary) AS row_num, COUNT(*) OVER () AS total_rows FROM employees) AS ranked_salaries WHERE row_num = (total_rows + 1) / 2;
This query calculates the middle value of the salary data by ordering the salaries and finding the value at the median position.
While not strictly a statistical function, the GROUP_CONCAT()
or STRING_AGG()
function is often used to aggregate data into a single string, useful for creating comma-separated lists or grouping values.
If you want to create a list of all employee names in each department, you can use GROUP_CONCAT()
(MySQL) or STRING_AGG()
(PostgreSQL):
SELECT department, GROUP_CONCAT(employee_name) AS employees FROM employees GROUP BY department;
This query will return a list of employee names, concatenated for each department.
SQL provides a wide range of statistical functions that can be used to perform data analysis and gain insights from your data. These functions help to calculate totals, averages, counts, and other statistical measures. By combining these functions with filtering, grouping, and sorting, you can perform complex data analysis directly within your SQL queries.