Recently I worked on a project where I had to step up my game in SQL. I decided to publish some queries with the context of why I would need them in the hope of helping someone who’s entering this wild SQL world. I feel the context is essential here, some of these queries I’ve seen before in SQL tutorials, but I couldn’t really understand in what context I could use them.
I was looking for a place where I could find a database with a table I could easily use those queries and I found the best table in the world, dunder_mifflin_paper_sales😂. If you don’t know what I’m talking about, do yourself a favor and Wikipedia it!
This dataset is available at Mode, you can sign up for free and find it at Mode Community (where all free datasets live). Just add the queries below to see the same results, or even create other queries with the same dataset!
TABLE SIZE
One query I had never seen before was to get a table size. This is only important for very heavy tables so you can maybe get an estimate of how long it would take to upload it. Pg_total_relation_size
includes the total size of a table, including indices and additional objects.
SELECT pg_size_pretty(
pg_total_relation_size('tutorial.dunder_mifflin_paper_sales'));
SELECT EVERYTHING
One good way of taking a first peek at what a table has is listing every single column and table from it. In cases of very big tables, adding a limit is recommended, since it may take a while to show everything. For dunder_mifflin_paper_sales case, I used a limit of 5.
SELECT * FROM tutorial.dunder_mifflin_paper_sales LIMIT 5;
SELECT WHEN NULL
One check I did was to verify if there were null rows. For dunder_mifflin_paper_sales example, customers added nice reviews about our favorite paper company. Let’s say someone wanted to list all reviews, but some companies didn’t write any review, so if I had a query like
SELECT account_name, review
FROM tutorial.dunder_mifflin_paper_sales;
There would definitely be some empty fields. For that not to happen, and we list only the ones who wrote a review, we run the query
SELECT account_name, review
FROM tutorial.dunder_mifflin_paper_sales
WHERE review IS NOT NULL;
ORDER BY
A basic way to list is using ORDER BY when you need it listed by a specific column. Imagine a situation in which we wanted to list our biggest customers and we would know that by a column called business_size
. So I selected account_name
and business_size
and order them by business_size
. DESC will make it show the biggest customer first.
SELECT account_name, business_size
FROM tutorial.dunder_mifflin_paper_sales
ORDER BY business_size DESC LIMIT 5;
COUNT/ GROUP BY
At that project at work, I had to do many kinds of counting to ensure the tables were correct. A good way to get started is just a count of all rows.
SELECT COUNT(*) FROM tutorial.dunder_mifflin_paper_sales;
Imagine that, at Dunder Mifflin, every sale is recorded with a status, which could be CANCELLED, RETURNED or COMPLETED. Let’s imagine someone needed to know how many of each status we had from all those sales. The query we need is
SELECT status, COUNT(*)
FROM tutorial.dunder_mifflin_paper_sales
GROUP BY status;
We use GROUP BY because there are many sales for each status. So first we group them, then we count how many of each exist.
As we can see the big majority of the sales were COMPLETED🎉 Good job, Dunder Mifflin!
Another example where we could use a similar query would be to understand people’s rates about Dunder Mifflin. There is a column called rating
where customers were able to rate our favorite paper company. This is the query we need
SELECT rating, COUNT(*)
FROM tutorial.dunder_mifflin_paper_sales
GROUP BY rating
ORDER BY rating DESC;
We group by rating so we can do the counting of each rating, then we order by rating in a descending way, so the highest rating shows up first. 5 stars don’t really sound like Dunder Mifflin, so a tie between 5, 4 and 3 stars makes sense! 😂
I also needed to run the same kind of query but to find out how many nulls I had. In Dunder Mifflin's example, imagine we wanted to count how many null reviews we had per state. We could run this
SELECT shipping_state, COUNT(*)
FROM tutorial.dunder_mifflin_paper_sales
WHERE review IS NULL
GROUP BY shipping_state
ORDER BY COUNT(*) DESC;
We group by and count shipping_state
, adding the condition “where review is null”, so that would give us the desired result. Apparently Alabama doesn’t really like expressing their opinion about Dunder Mifflin😂
In the same spirit, we could find out which state loves to review Dunder Mifflin! The query would be exactly the same, except this time we use IS NOT NULL:
SELECT shipping_state, COUNT(*)
FROM tutorial.dunder_mifflin_paper_sales
WHERE review IS NOT NULL
GROUP BY shipping_state
ORDER BY COUNT(*) DESC;
Aaaand Montana is the winner 🤔😂
THE BEST SALESPERSON
And finally, the SQL query that you’ve always wanted to run but didn’t know… That’s right, we’re going to find out WHO’S THE BEST SALESPERSON AT DUNDER MIFFLIN!
Drumrolls.. 🥁
Run this query:
SELECT account_manager, COUNT(*)
FROM tutorial.dunder_mifflin_paper_sales
WHERE status = ‘COMPLETED’
GROUP BY account_manager
ORDER BY COUNT(*) DESC;
What I did was to count all statuses that were ‘COMPLETED’, which means the sale was made, group by account_manager
, and order by status
count.
Aaaand FACT! Dwight Schrute is actually the best salesperson in the company, those numbers can’t lie😂
These are some of the queries I used at my project and I hope they can help you to better understand the SQL world! If you’d like to find out any other thing about this Dunder Mifflin dataset and it’s having a hard time writing the right SQL query, feel free to reach out!