top of page
  • Writer's pictureJieun Kim

Analyzing Business Data in SQL (II)

Updated: Dec 14, 2022

Welcome back! In the first post, we investigated KPIs that measure various aspects of a company’s performance. However, KPIs don’t tell us much about the distribution of a company’s data. In this second post, we will explore unit economics, histograms, and percentiles; the last two are especially important to understand distributions.


Step of this analysis

The project consists of the following sections:

  1. Unit economics and distributions

  2. Unit economics

  3. Histogram

  4. Bucketing

  5. Percentile

  6. Executive report

  7. Survey of useful functions

  8. Pivoting

  9. Producing executive reports

  10. Summary

Each section has a case study with business analytics questions to have a deeper understanding of the Delivr's business model.



1. Unit economics and distributions

1-A. Unit economics

Let’s start with unit economics. The KPIs in the first post measure a company’s overall performance. Revenue, for example, is how much money a company generates overall.

In contrast, unit economics measures performance per unit (or user).

Rather than revenue, we would calculate the average revenue per user, or ARPU.

  • The formula for ARPU = the overall revenue / the count of users.

  • This tells us how much revenue the company generates from each of its users on average.

  • ARPU is useful since it measures a company’s success at scaling its business model.

Even if a company’s overall revenue is increasing, if its ARPU is decreasing, the company is less effective at generating revenue from each of its users. More users are paying, but each of them is paying less. Unit economics isn’t restricted to ARPU. In the following exercises, we’ll write queries to calculate ARPU and other per-unit KPIs, such as the average orders per user.


1-1. Average Revenue Per User (ARPU)

Case 1 Dave from Finance wants to study Delivr's performance in revenue and orders per each of its user base. In other words, he wants to understand its unit economics. Help Dave kick off his study by calculating the overall average revenue per user (ARPU).

  • Return a table of user IDs and the revenue each user generated.

  • Wrap the previous query in a CTE named kpi.

  • Return the average revenue per user (ARPU).


🌠Result

ARPU is an indicator of how well a company is scaling in generating revenue, and an ARPU of $199.56 is an excellent indicator.






1-2. ARPU per month

Case 2 Next, Dave wants to see whether ARPU has increased over time. Even if Delivr's revenue is increasing, it's not scaling well if its ARPU is decreasing—it's generating less revenue from each of its customers. Send Dave a table of ARPU by month.

  • Store revenue and the number of unique active users by month in the kpi CTE.

  • Calculate ARPU by dividing the revenue by the number of users.

  • Order the results by month in ascending order.



🌠Result

Great! ARPU by month allows you to track how well your company is scaling over time. From July, there's a clear upwards trend in ARPU's value month over month.












1-3. Average orders per user

Case 3 Dave wants to add the average orders per user value to his unit economics study, since more orders usually correspond to more revenue. Calculate the average orders per user for Dave.


Note: The count of distinct orders is different than the sum of ordered meals. One order can have many meals within it. Average orders per user depends on the count of orders, not the sum of ordered meals.

  • Store the count of distinct orders and distinct users in the kpi CTE.

  • Calculate the average orders per user.

  • Remember to use GREATEST() to avoid dividing by 0.


🌠Result

Like ARPU, the average orders per user KPI is a way to study Delivr's unit economics. Almost 9 orders per user is a high number for only six months of operation!






1-B. Histogram

Now that we have an idea of unit economics, let’s explore distributions. Understanding your dataset's distribution helps you understand which values are frequent and which are rare. The best way to visualize and understand a dataset’s distribution is to plot its histogram.


A histogram visualizes the frequencies of each value in a dataset. The bar graph is the histogram of orders per user. 7 users have ordered once, 42 users have ordered twice, and so on. To plot a histogram, we need the frequency table. The below frequency table is the data source of the plot. Histograms are valuable visualizations to understand how a dataset’s values are distributed. In the following exercises, we write queries to get frequency tables, from which histograms are plotted.

​Number of total orders

Number of users

1

7

2

42

3

65

​...

...


1-4. Histogram of Revenue

Case 4 After determining that Delivr is doing well at scaling its business model, Dave wants to explore the distribution of revenues. He wants to see whether the distribution is U-shaped or normal to see how best to categorize users by the revenue they generate. Send Dave a frequency table of revenues by user.

  • Store each user ID and the revenue Delivr generates from it in the user_revenues CTE.

  • Return a frequency table of revenues rounded to the nearest hundred and the users generating those revenues.


🌠Result

A frequency table allows us to plot a dataset's histogram. Notice that revenue is rounded to a negative place. When a negative value is passed to the ROUND function as the second argument, it rounds to the nearest ten to the power of the absolute value of what was passed. For example, passing -2 to the ROUND function rounds to the nearest hundred. Revenues are usually decimal values, so it’s unlikely that two or more users generated the exact same revenue. This clutters the histogram. That's why revenue is rounded here.






1-5. Histogram of orders

Case 5 Dave also wants to plot the histogram of orders to see if it matches the shape of the histogram of revenues. Send Dave a frequency table of orders by user.

  • Set up the frequency tables query by getting each user's count of orders.

  • Return a frequency table of orders and the count of users with those orders.


🌠Result

Great! Now we get the full frequency table with which we can plot the above histogram.












1-C. Bucketing with the CASE statement

Histograms are great at visualizing the distribution of values in a dataset. However, we can't customize the histograms -- if we're plotting a histogram of orders by user, we can't separate users into separate groups and check the frequencies there. To customize your histogram and create groups, we need the CASE statement. The CASE statement is how SQL implements conditional logic.


This query separates meals into three price categories, or "buckets": Low, medium, and high-priced meals. If a meal's price is less than $4, it's a low-priced meal; if it's less than $6, it's a medium-priced meal, otherwise, it's a high-priced meal. For each conditional statement, start the condition with WHEN, and start the resulting value of that condition with THEN. If there's a default value, precede it with ELSE. Grouping by the price category column and counting the meal IDs returns a table of each category and how many meals are in it. Using CASE like this is an effective way to split your data into categories.

SELECT 
CASE 
    WHEN meal_price < 4 THEN 'Low-price meal'
    WHEN meal_price < 6 THEN 'Mid-price meal'
    ELSE 'High-price meal' 
END AS price_category,
COUNT(DISTINCT meal_id)
FROM meals 
GROUP BY price_category;

As opposed to a histogram, we can visualize the result using a bar graph. Each bar represents a category, and its height is the number of users in that category. The advantage of bar graphs over histograms is that the data is summarized. Histograms show the full spectrum of values found in the data, whereas bar graphs show a summarized version. This is useful especially when presenting to management, who often want a quick and easy way to understand the data's distribution. Bucketing summarizes the frequency tables and presents a dataset's distribution in a cleaner way. In the following exercises, we'll write queries that split users into categories.



1-6. Bucketing users by revenue

Case 6 Based on his analysis, Dave identified that $150 is a good cut-off for low-revenue users, and $300 is a good cut-off for mid-revenue users. He wants to find the number of users in each category to tweak Delivr's business model. Split the users into low, mid, and high-revenue buckets, and return the count of users in each group.

  • Store each user ID and the revenue it generates in the user_revenues CTE.

  • Return a table of the revenue groups and the count of users in each group.


🌠Result

The table shows that around 50% of Delivr's customers are mid-revenue users. Categorizing users by revenue will be useful when presenting to management, who often want a quick and easy way to understand the data's distribution.





1-7. Bucketing users by orders

Case 7 Dave is repeating his bucketing analysis on orders to have a more complete profile of each group. He determined that 8 orders is a good cut-off for the low-orders group, and 15 is a good cut-off for the medium orders group. Send Dave a table of each order group and how many users are in it.

  • Store each user ID and its count of orders in a CTE named user_orders.

  • Set the cut-off point for the low-orders bucket to 8 orders, and set the cut-off point for the mid-orders bucket to 15 orders.

  • Count the distinct users in each bucket.



🌠Result

Around 50% of Delivr customers are mid-orders users, who ordered less than 15 times in total.







1-D. Percentiles

The last method to understand a dataset's distribution is the percentile. Percentiles help us understand what percentage of a dataset is beneath a certain value, and what percentage is at or above it.

  • The Nth percentile is the value for which N% of your data is beneath this value.

  • The lowest or minimum value in a dataset is the 0th percentile; 0% of your data's values are beneath this value.

  • The highest or maximum value is the 99th percentile; 99% of your data's values are below this value.


  • Positive skew: If your dataset's median is less than the average (or mean), your data is skewed positively. It means that some high values are pushing the average up, whereas most values are beneath that average.

  • Negative skew: If your dataset's median is more than the average, your data is skewed negatively. It means that some low values are pushing the average down, whereas most values are above that average.

Percentiles and quartiles are quick ways to determine values by which you can separate your dataset into groups. They're also used to determine the skew of your data. Let's write queries to calculate percentiles in the following exercises.


1-8. Revenue quartiles

Case 8 Dave is wrapping up his study, and wants to calculate a few more figures. He wants to find out the first, second, and third revenue quartiles. He also wants to find the average to see in which direction the data is skewed. Calculate the first, second, and third revenue quartiles, as well as the average.

  • Store each user ID and the revenue Delivr generates from it in the user_revenues CTE.

  • Calculate the first, second, and third revenue quartile.

  • Calculate the average revenue.

  • We can calculate the 30th percentile for a column named column_a by using PERCENTILE_CONT(0.30) WITHIN GROUP (ORDER BY column_a ASC).


🌠Result

Quartiles are useful to pinpoint cutoff values in the dataset.




1-9. Interquartile range

Case 9 The final value that Dave wants is the count of users in the revenue interquartile range (IQR). Users outside the revenue IQR are outliers, and Dave wants to know the number of typical users. Return the count of users in the revenue IQR.

  • Return a table of user IDs and generated revenues for each user.

  • Wrap the previous query in a CTE named user_revenues.

  • Calculate the first and third revenue quartiles.

  • Count the number of distinct users.

  • Filter out all users outside the IQR.


🌠Result

Great! We removed the users at each end of the revenue spectrum using the IQR.






2. Generating an executive report

2-A. Survey of useful functions

In this final chapter, we will package the KPIs into a readable report in order to present to managers and executives. Dates are everywhere in business data. Throughout this project, we have truncated dates using DATE_TRUNC, and we cast the output to DATE to drop the hours, minutes, and so on.

DATE_TRUNC('quarter', '2018-08-13') :: DATE
Result: '2018-07-01'

However, human-readable dates are important in reporting, and the default date format isn't very readable. How do you get from the default date format to something more readable?

TO_CHAR is a function that formats dates based on the format patterns passed to the function.


Let's explore how TO_CHAR works. TO_CHAR takes two inputs, a date and a format string. Its output is a string of the date formatted according to the patterns in the format string. For example, the pattern Dy represents the abbreviated day name. Take the first day of June 2018. June 1 is a Friday. Passing the first of June 2018 with a format string of Dy returns Fri, the abbreviation for Friday.

TO_CHAR('2018-06-01', 'Dy-DD') 
Result: 'Fri-01'

Patterns in the format string will automatically be replaced with what they represent, extracted from the date. All other characters will remain as-is in the output. DD represents the date's day number. Dash is not a pattern. Adding dash and DD to the format string adds a dash and the day's number to the output.


Now, let's recall window functions. You calculated the registrations running total with SUM, and you used LAG to fetch the previous month's active users, or MAU, to calculate the MAU growth rate. A third useful window function is RANK, which assigns a rank to each row based on that row's position in a sorted partition. For example, ranking by revenue ranks users, eateries, or months by how much revenue they generated.

  • SUM(...) OVER (...) : Calculates a column's running total.

Example: SUM(registrations) OVER (ORDER BY registration_month) 
--calculates the registrations running total 
  • LAG(...) OVER (...) : Fetches a preceding row's value.

Example:LAG(mau) OVER (ORDER BY active_month)
-- returns the previoumonth's active users (MAU)
  • RANK() OVER(...) : Assigns a rank to each row based on that row's position in a sorted order.

Example:RANK() OVER (ORDER BY revenue DESC) 
-- ranks users, eateries, or months by the revenue they have generated

We will use these new two functions in the following cases to write queries for executive reports.


2-1. Formatting dates

Case 10 Eve from the Business Intelligence (BI) team lets us know that she needs help to write queries for reports. The reports are read by C-level execs, so they need to be as readable and quick to scan as possible. Eve tells that the C-level execs' preferred date format is something like Friday 01, June 2018 for 2018-06-01. We have a list of useful patterns. Figure out the format string that formats 2018-06-01 as Friday 01, June 2018 when using TO_CHAR.

  • Select the order date.

  • Format the order date so that 2018-06-01 is formatted as Friday 01, June 2018.


🌠Result

The first step to track changes is to have the current and previous periods' values in the same row. Our query returned a table of MAUs and the previous month's MAU for each month.






2-2. Rank users by their count of orders

Case 11 Eve tells us that she wants to report which user IDs have the most orders each month. She doesn't want to display long numbers, which will only distract C-level execs, so she wants to display only their ranks. Send Eve a list of the top 3 user IDs by orders in August 2018 with their ranks.

  • Keep only the orders in August 2018.

  • Wrap the previous query in a CTE named user_count_orders.

  • Select the user ID and rank all user IDs by the count of orders in descending order.

  • Only keep the top 3 users by their count of orders.


🌠Result

Ranking is a good way to reveal order while concealing raw values. The top 1 rank goes to the user with the most orders, the second-top 2 rank goes to the user with the second-most orders, and so on.






2-B. Pivoting

When producing reports, we often want to reshape a table to create visualizations of its data. Sometimes, that requires a rotation of a row into a column. How is that done in SQL? Pivoting is how you do it.

  • What is pivot?

1) To pivot a table is to rotate that table around a pivot column. In other words, it's transposing a row into a column.

2) Pivoting converts a long table with many rows and few columns, into a wide table, with fewer rows and more columns.

  • Why pivot?

1) Pivoting allows you to change a table's shape at will while preserving its data.

2) Unstacked data read in a wide table is easier to read than stacked data read in a long table.

The left table is the count of orders for the meals with IDs 0 and 1 in Delivr's first two months of operation. The right table is the left table pivoted by the delivr_month column. Notice that the distinct values in delivr_month, June and July 2018, replaced delivr_month as columns. The table is automatically reshaped to preserve the relations of the values (in this case, the count of orders) in each row.


How do we go from the left table to the right? Enter CROSSTAB. CROSSTAB takes a source table and pivots it by one of its columns. To use CROSSTAB, first import it by using this CREATE EXTENSION statement. tablefunc is a collection of SQL functions that aren't available by default in PostgreSQL. In this sense, CREATE EXTENSION is like import in Python.

CREATE EXTENSION IF NOT EXISTS tablefunc;

Pass the string of the source query whose output you want to pivot. Place two dollar signs after the opening parentheses and before the closing ones. This is because CROSSTAB pivots the table provided to it as a string, and two dollar signs enclosing a query is a safe way to convert it to a string. Then, select everything from the output of CROSSTAB, assigning it an alias (in this case, ct). Then, in parenthesis, list the pivoted table's columns and their data types.

SELECT * FROM CROSSTAB($$
    TEXT source_SQL
$$)

AS ct (column_1, DATA_TYPE_1,
       column_2, DATA_TYPE_2,
       ...,
       column_n, DATA_TYPE_N)
;


2-3. Pivoting user revenues by month

Case 12 Eve tells you that the C-level execs prefer wide tables over long ones because they're easier to scan. She prepared a sample report of user revenues by month, detailing the first 5 user IDs' revenues from June to August 2018. The execs told her to pivot the table by month. She's passed that task off to you.

Pivot the user revenues by month query so that the user ID is a row and each month from June to August 2018 is a column

  • Enable CROSSTAB() from tablefunc.

  • Declare the new pivot table's columns, user ID and the first three months of operation.


🌠Result

Pivoting is a core tool to reshape tables. When pivoting a table creates an empty cell with no data, its value is null. We can fix nulls with COALESCE().








2-4. Pivot total cost by eatery by month

Case 13 The C-level execs next tell Eve that they want a report on the total costs by eatery in the last two months. First, write a query to get the total costs by eatery in November and December 2018, then pivot by month.

  • Select the eatery and calculate total cost per eatery.

  • Keep only the records after October 2018.

  • Enable CROSSTAB from tablefunc.

  • Declare the new pivot table's columns, the eatery and the last two months of operation.

🌠Result

In lines 20-21, we encased two months by double quotation marks because they're column names, not values now. This is why we cast total cost to FLOAT earlier in line 9 -- it's to make sure that the types of the values in the unpivoted and pivoted values are the same.



2-C. Executive report

Reports are often read by managers and C-level executives, so readability is crucial in producing good reports. Here are some general readability guidelines when producing a report.

  • Dates: Use readable date formats (August 2018, not 2018-08-01) with TO_CHAR

  • Numbers: Round numbers to the second decimal at most (98.76, not 98.76234)

  • Table shape: Reshape long tables into wide ones, pivoting by date when possible

  • Order: Don't forget to sort! Make sure to pick which columns to sort by sensibly.


2-5. Executive report: Order rank by eatery and quarter

Case 14 Eve wants to produce a final executive report about the rankings of eateries by the number of unique users who order from them by quarter. She said she'll handle the pivoting, so you only need to prepare the source table for her to pivot.

Send Eve a table of unique ordering users by eatery and by quarter.

  • Fill in the format string that formats 2018-06-01 as Q2 2018.

  • Count the ordering users by eatery and by quarter.

  • Select the eatery and the quarter from the CTE.

  • Assign a rank to each row, with the top-most rank going to the row with the highest orders.

  • Import the tablefunc extension.

  • Pivot the table by quarter.

  • Select the new columns from the pivoted table.


🌠Result

This is the result, a ranking of eateries by the count of orders they've received. The lower the rank, the higher the count of orders -- rank 1 goes to the eatery with the highest count, and so on. It's clean and easy to interpret - the Moon Wok is dominant; Burgatorio is lagging behind, while the others vie for the middle ranks. It could even be passed to a visualization tool to produce a bump chart.



3. Summary

We've learned quite a bit, so let's go over the analysis to consolidate. In Chapter 1, we learned about revenue, cost, and profit, and how to use CTEs to combine the revenue and cost queries to calculate profit. In Chapter 2, we learned about user-centric methods, namely the registrations and active users KPIs. We also used window functions to calculate the active users growth and retention rates. In Chapter 3, we learned about unit economics, and understood how your data is distributed through histograms and percentiles. In Chapter 4, we learned how to produce an executive report using pivoting.


Now, you can now use SQL to analyze real-world business data. I hope you can make informed decisions by analyzing data!



 

Source: This project is based on the course "Analyzing Business Data in SQL" from Datacamp.


bottom of page