Learning SQL with Dunder Mifflin

Learning SQL with Dunder Mifflin

Recently I worked on a project where I had to step up my game in SQL. I decided to publish some queries with the context of why I would need them in the hope of helping someone who’s entering this wild SQL world. I feel the context is essential here, some of these queries I’ve seen before in SQL tutorials, but I couldn’t really understand in what context I could use them.

I was looking for a place where I could find a database with a table I could easily use those queries and I found the best table in the world, dunder_mifflin_paper_sales😂. If you don’t know what I’m talking about, do yourself a favor and Wikipedia it!

This dataset is available at Mode, you can sign up for free and find it at Mode Community (where all free datasets live). Just add the queries below to see the same results, or even create other queries with the same dataset!

TABLE SIZE

One query I had never seen before was to get a table size. This is only important for very heavy tables so you can maybe get an estimate of how long it would take to upload it. Pg_total_relation_size includes the total size of a table, including indices and additional objects.

SELECT pg_size_pretty(
   pg_total_relation_size('tutorial.dunder_mifflin_paper_sales'));

SELECT EVERYTHING

One good way of taking a first peek at what a table has is listing every single column and table from it. In cases of very big tables, adding a limit is recommended, since it may take a while to show everything. For dunder_mifflin_paper_sales case, I used a limit of 5.

SELECT * FROM tutorial.dunder_mifflin_paper_sales LIMIT 5;

SELECT WHEN NULL

One check I did was to verify if there were null rows. For dunder_mifflin_paper_sales example, customers added nice reviews about our favorite paper company. Let’s say someone wanted to list all reviews, but some companies didn’t write any review, so if I had a query like

SELECT account_name, review 
FROM tutorial.dunder_mifflin_paper_sales;

There would definitely be some empty fields. For that not to happen, and we list only the ones who wrote a review, we run the query

SELECT account_name, review 
FROM tutorial.dunder_mifflin_paper_sales 
WHERE review IS NOT NULL;

ORDER BY

A basic way to list is using ORDER BY when you need it listed by a specific column. Imagine a situation in which we wanted to list our biggest customers and we would know that by a column called business_size. So I selected account_name and business_size and order them by business_size. DESC will make it show the biggest customer first.

SELECT account_name, business_size 
FROM tutorial.dunder_mifflin_paper_sales 
ORDER BY business_size DESC LIMIT 5;

COUNT/ GROUP BY

At that project at work, I had to do many kinds of counting to ensure the tables were correct. A good way to get started is just a count of all rows.

SELECT COUNT(*) FROM tutorial.dunder_mifflin_paper_sales;

Imagine that, at Dunder Mifflin, every sale is recorded with a status, which could be CANCELLED, RETURNED or COMPLETED. Let’s imagine someone needed to know how many of each status we had from all those sales. The query we need is

SELECT status, COUNT(*) 
FROM tutorial.dunder_mifflin_paper_sales 
GROUP BY status;

We use GROUP BY because there are many sales for each status. So first we group them, then we count how many of each exist.

As we can see the big majority of the sales were COMPLETED🎉 Good job, Dunder Mifflin!

Another example where we could use a similar query would be to understand people’s rates about Dunder Mifflin. There is a column called rating where customers were able to rate our favorite paper company. This is the query we need

SELECT rating, COUNT(*) 
FROM tutorial.dunder_mifflin_paper_sales 
GROUP BY rating 
ORDER BY rating DESC;

We group by rating so we can do the counting of each rating, then we order by rating in a descending way, so the highest rating shows up first. 5 stars don’t really sound like Dunder Mifflin, so a tie between 5, 4 and 3 stars makes sense! 😂

I also needed to run the same kind of query but to find out how many nulls I had. In Dunder Mifflin's example, imagine we wanted to count how many null reviews we had per state. We could run this

SELECT shipping_state, COUNT(*) 
FROM tutorial.dunder_mifflin_paper_sales 
WHERE review IS NULL 
GROUP BY shipping_state 
ORDER BY COUNT(*) DESC;

We group by and count shipping_state, adding the condition “where review is null”, so that would give us the desired result. Apparently Alabama doesn’t really like expressing their opinion about Dunder Mifflin😂

In the same spirit, we could find out which state loves to review Dunder Mifflin! The query would be exactly the same, except this time we use IS NOT NULL:

SELECT shipping_state, COUNT(*) 
FROM tutorial.dunder_mifflin_paper_sales 
WHERE review IS NOT NULL 
GROUP BY shipping_state 
ORDER BY COUNT(*) DESC;

Aaaand Montana is the winner 🤔😂

THE BEST SALESPERSON

And finally, the SQL query that you’ve always wanted to run but didn’t know… That’s right, we’re going to find out WHO’S THE BEST SALESPERSON AT DUNDER MIFFLIN!

Drumrolls.. 🥁

Run this query:

SELECT account_manager, COUNT(*) 
FROM tutorial.dunder_mifflin_paper_sales 
WHERE status = ‘COMPLETED’ 
GROUP BY account_manager 
ORDER BY COUNT(*) DESC;

What I did was to count all statuses that were ‘COMPLETED’, which means the sale was made, group by account_manager, and order by status count.

Aaaand FACT! Dwight Schrute is actually the best salesperson in the company, those numbers can’t lie😂

These are some of the queries I used at my project and I hope they can help you to better understand the SQL world! If you’d like to find out any other thing about this Dunder Mifflin dataset and it’s having a hard time writing the right SQL query, feel free to reach out!