Descriptive statistics are fundamental tools in data analysis, providing simple summaries about the sample and the measures. These statistics are essential for beginner data scientists to understand the basic properties of data sets. Let’s have a look at how to calculate key descriptive statistics using basic command-line tools, including mean, median, mode, range, and standard deviation.

First, let’s review what we will be calculating, to ensure we are all on the same page:

**Mean**: The average of a set of numbers**Median**: The middle value in a set of numbers**Mode**: The most frequently occurring value in a set of numbers**Range**: The difference between the highest and lowest values**Standard Deviation**: A measure of the amount of variation or dispersion in a set of values

## Setting Up the Environment

Aside from having a functioning Bash shell (or zsh shell on your macOS system), we will be relying on:

**awk**: A versatile programming language primarily used for pattern scanning and processing.**bc**: An arbitrary precision calculator language for performing math operations.**sort**: A command-line utility used to sort lines of text files.

## Preparing the Data

Let’s start by creating a simple data file. This file will contain a set of numbers, one per line.

echo -e "5\n7\n7\n10\n12\n15" > data.txt

Let’s get into the calculations.

## Calculating Mean

The mean is the average of the numbers. We can calculate it by summing all the values and dividing by the count of values. We will use **awk** to do so.

awk '{ total += $1; count++ } END { print total/count }' data.txt

This command adds each value in **data.txt** to a running total and increments a counter for each line. At the end, it prints the total divided by the count.

9.333

## Calculating Median

The median is the middle value in a sorted list of numbers. If the list has an even number of values, the median is the average of the two middle numbers. We will do so using **sort** and **awk**.

sort -n data.txt | awk '{ a[i++] = $1;} END { if (i % 2 == 0) { print (a[i/2] + a[i/2 - 1]) / 2; } else { print a[int(i/2)]; }}'

Output:

8.5

This script sorts the data and stores it in an array. Afterwards, it checks if the count of numbers is even or odd to determine how to calculate the median.

## Calculating Mode

The mode is the value that appears most frequently in the data set. We will find this using **awk**.

awk '{a[$1]++} END {for (i in a) {if (a[i] > max) {max = a[i]; mode = i}}; print mode}' data.txt

Output:

7

This script counts the occurrences of each value and keeps track of the maximum count and corresponding mode.

## Calculating Range

The range is the difference between the maximum and minimum values in the data set. We can do so using **sort** and **bc**.

min=$(sort -n data.txt | head -1)max=$(sort -n data.txt | tail -1)echo "$max - $min" | bc

10

This command finds the minimum and maximum values using **sort**, then calculates the range using **bc**.

## Calculating Standard Deviation

Standard deviation measures the dispersion of the values in the data set. First, we calculate the variance (the average of the squared differences from the mean), and then the standard deviation is the square root of the variance. We will accomplish this in Bash using **awk** and **bc**.

awk '{ total += $1; count++; array[NR] = $1 } END { mean = total / count; for(i=1; i<=count; i++){ sumsq += (array[i] - mean)^2; } variance = sumsq / count; print sqrt(variance); }' data.txt

Output:

3.39935

This script calculates the mean, then sums the squared differences from the mean for each value, computes the variance, and finally calculates the standard deviation.

## Putting It All Together

Here is an example script that calculates all the above statistics in one go:

#!/bin/bashfile="data.txt"# Meanmean=$(awk '{ total += $1; count++ } END { print total/count }' $file)# Medianmedian=$(sort -n $file | awk '{ a[i++] = $1;} END { if (i % 2 == 0) { print (a[i/2] + a[i/2 - 1]) / 2; } else { print a[int(i/2)]; }}')# Modemode=$(awk '{a[$1]++} END {for (i in a) {if (a[i] > max) {max = a[i]; mode = i}}; print mode}' $file)# Rangemin=$(sort -n $file | head -1)max=$(sort -n $file | tail -1)range=$(echo "$max - $min" | bc)# Standard Deviationstddev=$(awk '{ total += $1; count++; array[NR] = $1 } END { mean = total / count; for(i=1; i<=count; i++){ sumsq += (array[i] - mean)^2; } variance = sumsq / count; print sqrt(variance); }' $file)echo "Mean: $mean"echo "Median: $median"echo "Mode: $mode"echo "Range: $range"echo "Standard Deviation: $stddev"

Output:

Mean: 9.33333Median: 8.5Mode: 7Range: 10Standard Deviation: 3.39935

This script calculates the mean, median, mode, range, and standard deviation for the data in **data.txt** and prints the results. It integrates all the steps discussed, providing a comprehensive tool for descriptive statistics analysis.

## Final Thoughts

You may be asking yourself, why use Bash? Is it easier? Faster? **Better**? Isn't easier to just use Python and its built in features, or those of well-worn libraries such Numpy?

While programming questions like this are highly subjective, do keep in mind that, while more code was used than a single line that might be used calling a Python library, Bash code excerpts could be saved to file and imported into, or call directly from, your future Bash scripts. There may be a speed increase in using Bash over Python for some functions, though the optimized code of a project such as Numpy likely outweighs this. There is also the benefit of using well-tested libraries to feel reasonably certain there are no bugs in th code, and that your results are accurate.

In the end, do it to learn another language, try something different, get a fresh view on programming, and maybe learn something of value to keep handy in the future.

Check the following resources for additional information on writing Bash scripts:

- How to Automate Data Collection with Bash Scripts
- Tips for Writing Awesome Bash Scripts
- 5 Interesting Things to do with Bash Scripts

Happy bashing!