Blog

Why SQL? Notes from (big) data analysis practice

Posted By Jakub Nowacki, 07 September 2017

Structured Query Language (SQL) is around from the ‘80 and it is still going strong. For many year it has been a lingua franca of many areas of data processing, such as databases, data warehouses, business intelligence etc. Since it’s a natural language for data handling, it’s now making its way into more modern systems processing big data sets.

One of the first project that leveraged SQL for processing of big data is Apache Hive. Its engine originally built on top of Hadoop’s MapReduce implementation made the use of big data ecosystem a lot easier for wider audience than just (Java) programming experts. While Hive as a form of data warehouse is still going strong, many other popular projects adopted the use of SQL for data processing. Apache Spark, a current leader in big data processing, made its SQL extension a recommended tool for its users. The trend soon expanded and new tools like Presto or Apache Drill followed soon. Also, some ready-to-use cloud solutions like Google BigQuery, Amazon Athena or Azure Data Lake Analytics. And this doesn’t even include the classical database or data warehouse solutions, which are also becoming more and more capable, with their major language being SQL.

Thus, if you want to work in the data analytics realm, you should go beyond that SELECT * FROM table query and learn a bit more what SQL has to offer. Note that, at the end of the day, you may not need to use pure SQL as objects like Spark DataFrames closely mimic it, though, it’s good to know the way of thinking, that SQL has in place.

In this post I’ll show why it’s important to learn and understand SQL, and hou it will influence your future work with data.

Get some data

Lets start from getting some data to analyze. In this examle we will use Watson Analytics Sample Data - Sales Product. The sample file will be downloaded and stored in a local data folder.

import os
import urllib.request

if not os.path.exists('data'):
    os.mkdir('data')

url = 'https://community.watsonanalytics.com/wp-content/uploads/2015/08/WA_Sales_Products_2012-14.csv'    
csv_file = 'data/WA_Sales_Products_2012-14.csv'

urllib.request.urlretrieve(url, 'data/WA_Sales_Products_2012-14.csv');

For bravity sake lets look at data first using Spark, as the SQL queries will be later demonstrated using its capabilities; note that I use Spark 2+, thus, the CSV support is build in.

import pyspark
import pyspark.sql.functions as func

spark = pyspark.sql.SparkSession.builder\
    .appName('why_sql')\
    .getOrCreate()
sales = spark.read.csv(csv_file, header=True, inferSchema=True)

# This is needed for string queries, you don't need to do it using DF syntax
sales.createTempView('sales')

sales.printSchema()
sales.show()
root
 |-- Retailer country: string (nullable = true)
 |-- Order method type: string (nullable = true)
 |-- Retailer type: string (nullable = true)
 |-- Product line: string (nullable = true)
 |-- Product type: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Revenue: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Gross margin: double (nullable = true)

+----------------+-----------------+-------------+--------------------+--------------------+--------------------+----+-------+---------+--------+------------+
|Retailer country|Order method type|Retailer type|        Product line|        Product type|             Product|Year|Quarter|  Revenue|Quantity|Gross margin|
+----------------+-----------------+-------------+--------------------+--------------------+--------------------+----+-------+---------+--------+------------+
|   United States|              Fax|Outdoors Shop|   Camping Equipment|        Cooking Gear|TrailChef Deluxe ...|2012|Q1 2012| 59628.66|     489|  0.34754797|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|        Cooking Gear|TrailChef Double ...|2012|Q1 2012| 35950.32|     252|   0.4742745|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|               Tents|           Star Dome|2012|Q1 2012| 89940.48|     147|  0.35277197|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|               Tents|        Star Gazer 2|2012|Q1 2012|165883.41|     303|  0.28293788|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|       Sleeping Bags|     Hibernator Lite|2012|Q1 2012| 119822.2|    1415|  0.29145017|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|       Sleeping Bags|  Hibernator Extreme|2012|Q1 2012| 87728.96|     352|  0.39814629|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|       Sleeping Bags| Hibernator Camp Cot|2012|Q1 2012| 41837.46|     426|  0.33560737|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|            Lanterns|        Firefly Lite|2012|Q1 2012|  8268.41|     577|  0.52896022|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|            Lanterns|     Firefly Extreme|2012|Q1 2012|   9393.3|     189|  0.43420523|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|            Lanterns|     EverGlow Single|2012|Q1 2012|  19396.5|     579|  0.46149254|
|   United States|              Fax|Outdoors Shop|   Camping Equipment|            Lanterns|     EverGlow Butane|2012|Q1 2012|  6940.03|     109|  0.36186587|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|                Rope|       Husky Rope 50|2012|Q1 2012|  20003.2|     133|  0.32905585|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|                Rope|       Husky Rope 60|2012|Q1 2012|  14109.4|      79|  0.29165733|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|                Rope|      Husky Rope 100|2012|Q1 2012| 73970.22|     227|  0.30126435|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|                Rope|      Husky Rope 200|2012|Q1 2012| 77288.64|     143|  0.31477575|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|              Safety|Granite Climbing ...|2012|Q1 2012| 62464.88|     898|  0.24468085|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|              Safety|       Husky Harness|2012|Q1 2012|  34154.9|     559|  0.28363339|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|              Safety|Husky Harness Ext...|2012|Q1 2012|  36396.8|     352|  0.47843327|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|              Safety|Granite Signal Mi...|2012|Q1 2012|  4074.84|     126|  0.51422387|
|   United States|              Fax|Outdoors Shop|Mountaineering Eq...|Climbing Accessories|   Granite Carabiner|2012|Q1 2012| 15122.72|    4022|   0.4787234|
+----------------+-----------------+-------------+--------------------+--------------------+--------------------+----+-------+---------+--------+------------+
only showing top 20 rows

Now we can look at some data. Consider the below simple query:

SELECT
    `Product line`,
    Product,
    Revenue
FROM sales
WHERE Year = 2012
LIMIT 10

Reading the above in words, we’d like to get first 10 lines from data for year 2012 but only columns Product Line, Product and Revenue. How would we do it in pure Python? See a simple sample below:

import csv

with open(csv_file, newline='') as f:
    reader = csv.reader(f)

    # Get the header line and makie it into a name-index dict
    header = {name: i for i, name in enumerate(next(reader))}
    print('Header: {}'.format(header))

    # List with results
    res = list()

    # Columns to select
    cols = ['Product line', 'Product', 'Revenue']

    i = 0
    for line in reader:
        # limit 10
        if i >= 10:
            break

        # select ... where year = 2012
        if line[header['Year']] == '2012':
            res.append({col: line[header[col]] for col in cols})
            i += 1

res
Header: {'Retailer country': 0, 'Order method type': 1, 'Retailer type': 2, 'Product line': 3, 'Product type': 4, 'Product': 5, 'Year': 6, 'Quarter': 7, 'Revenue': 8, 'Quantity': 9, 'Gross margin': 10}

[{'Product': 'TrailChef Deluxe Cook Set',
  'Product line': 'Camping Equipment',
  'Revenue': '59628.66'},
 {'Product': 'TrailChef Double Flame',
  'Product line': 'Camping Equipment',
  'Revenue': '35950.32'},
 {'Product': 'Star Dome',
  'Product line': 'Camping Equipment',
  'Revenue': '89940.48'},
 {'Product': 'Star Gazer 2',
  'Product line': 'Camping Equipment',
  'Revenue': '165883.41'},
 {'Product': 'Hibernator Lite',
  'Product line': 'Camping Equipment',
  'Revenue': '119822.2'},
 {'Product': 'Hibernator Extreme',
  'Product line': 'Camping Equipment',
  'Revenue': '87728.96'},
 {'Product': 'Hibernator Camp Cot',
  'Product line': 'Camping Equipment',
  'Revenue': '41837.46'},
 {'Product': 'Firefly Lite',
  'Product line': 'Camping Equipment',
  'Revenue': '8268.41'},
 {'Product': 'Firefly Extreme',
  'Product line': 'Camping Equipment',
  'Revenue': '9393.3'},
 {'Product': 'EverGlow Single',
  'Product line': 'Camping Equipment',
  'Revenue': '19396.5'}]

Maybe it’s not very complicated, but we used some Python magic, including standard library’s CSV reader, and got the result quite easily.

So how can we do in Spark (or other SQL capable system)? Well, we use the above query; note that we already read the file into DF:

spark.sql("""
SELECT
    `Product line`,
    Product,
    Revenue
FROM sales
WHERE Year = 2012
LIMIT 10
""").show()
+-----------------+--------------------+---------+
|     Product line|             Product|  Revenue|
+-----------------+--------------------+---------+
|Camping Equipment|TrailChef Deluxe ...| 59628.66|
|Camping Equipment|TrailChef Double ...| 35950.32|
|Camping Equipment|           Star Dome| 89940.48|
|Camping Equipment|        Star Gazer 2|165883.41|
|Camping Equipment|     Hibernator Lite| 119822.2|
|Camping Equipment|  Hibernator Extreme| 87728.96|
|Camping Equipment| Hibernator Camp Cot| 41837.46|
|Camping Equipment|        Firefly Lite|  8268.41|
|Camping Equipment|     Firefly Extreme|   9393.3|
|Camping Equipment|     EverGlow Single|  19396.5|
+-----------------+--------------------+---------+

Well, that is a bit easier, don’t you think? Note that for readability, I use show, which prints out a string table, but I can get the data as well via collect or toPandas, if you want to get Pandas DF; e.g.:

spark.sql("""
SELECT
    `Product line`,
    Product,
    Revenue
FROM sales
WHERE Year = 2012
LIMIT 10
""").collect()
[Row(Product line='Camping Equipment', Product='TrailChef Deluxe Cook Set', Revenue=59628.66),
 Row(Product line='Camping Equipment', Product='TrailChef Double Flame', Revenue=35950.32),
 Row(Product line='Camping Equipment', Product='Star Dome', Revenue=89940.48),
 Row(Product line='Camping Equipment', Product='Star Gazer 2', Revenue=165883.41),
 Row(Product line='Camping Equipment', Product='Hibernator Lite', Revenue=119822.2),
 Row(Product line='Camping Equipment', Product='Hibernator Extreme', Revenue=87728.96),
 Row(Product line='Camping Equipment', Product='Hibernator Camp Cot', Revenue=41837.46),
 Row(Product line='Camping Equipment', Product='Firefly Lite', Revenue=8268.41),
 Row(Product line='Camping Equipment', Product='Firefly Extreme', Revenue=9393.3),
 Row(Product line='Camping Equipment', Product='EverGlow Single', Revenue=19396.5)]

Now lets add a little more complexity and append an average revenue to every row as follows:

SELECT
    `Product line`,
    Product,
    Revenue,
    (SELECT AVG(Revenue) FROM sales) AS `Avg Revenue`
FROM sales
WHERE Year = 2012
ORDER BY Revenue DESC
LIMIT 10

In Python or any other imperative programming language, we’d need to go over all records calculating the average first and append it to the result, e.g.:

with open(csv_file, newline='') as f:
    reader = csv.reader(f)

    # Get the header line and makie it into a name-index dict
    header = {name: i for i, name in enumerate(next(reader))}
    print('Header: {}'.format(header))

    # List with results
    res = list()

    # Columns to select
    cols = ['Product line', 'Product', 'Revenue']

    i = 0
    cum_sum = 0 # cumulative sum for average
    n = 0 # number of rows
    for line in reader:
        cum_sum += float(line[header['Revenue']])
        n += 1

        # select ... where year = 2012 limit 10
        if line[header['Year']] == '2012' and i < 10:
            res.append({col: line[header[col]] for col in cols})
            i += 1

# Calculate average
rev_avg = cum_sum/n

# Update the result

def append_rev(d, rev):
    d['Average Revenue'] = rev
    return d

res = list(map(lambda d: append_rev(d, rev_avg), res))

res
Header: {'Retailer country': 0, 'Order method type': 1, 'Retailer type': 2, 'Product line': 3, 'Product type': 4, 'Product': 5, 'Year': 6, 'Quarter': 7, 'Revenue': 8, 'Quantity': 9, 'Gross margin': 10}





[{'Average Revenue': 42638.292909070755,
  'Product': 'TrailChef Deluxe Cook Set',
  'Product line': 'Camping Equipment',
  'Revenue': '59628.66'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'TrailChef Double Flame',
  'Product line': 'Camping Equipment',
  'Revenue': '35950.32'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Star Dome',
  'Product line': 'Camping Equipment',
  'Revenue': '89940.48'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Star Gazer 2',
  'Product line': 'Camping Equipment',
  'Revenue': '165883.41'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Hibernator Lite',
  'Product line': 'Camping Equipment',
  'Revenue': '119822.2'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Hibernator Extreme',
  'Product line': 'Camping Equipment',
  'Revenue': '87728.96'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Hibernator Camp Cot',
  'Product line': 'Camping Equipment',
  'Revenue': '41837.46'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Firefly Lite',
  'Product line': 'Camping Equipment',
  'Revenue': '8268.41'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'Firefly Extreme',
  'Product line': 'Camping Equipment',
  'Revenue': '9393.3'},
 {'Average Revenue': 42638.292909070755,
  'Product': 'EverGlow Single',
  'Product line': 'Camping Equipment',
  'Revenue': '19396.5'}]

Using Spark SQL we can just add the subquery to generate a new column as follows:

spark.sql("""
SELECT
    `Product line`,
    Product,
    Revenue,
    (SELECT AVG(Revenue) FROM sales) AS `Avg Revenue`
FROM sales
WHERE Year = 2012
LIMIT 10
""").show()
+-----------------+--------------------+---------+------------------+
|     Product line|             Product|  Revenue|       Avg Revenue|
+-----------------+--------------------+---------+------------------+
|Camping Equipment|TrailChef Deluxe ...| 59628.66|42638.292909070165|
|Camping Equipment|TrailChef Double ...| 35950.32|42638.292909070165|
|Camping Equipment|           Star Dome| 89940.48|42638.292909070165|
|Camping Equipment|        Star Gazer 2|165883.41|42638.292909070165|
|Camping Equipment|     Hibernator Lite| 119822.2|42638.292909070165|
|Camping Equipment|  Hibernator Extreme| 87728.96|42638.292909070165|
|Camping Equipment| Hibernator Camp Cot| 41837.46|42638.292909070165|
|Camping Equipment|        Firefly Lite|  8268.41|42638.292909070165|
|Camping Equipment|     Firefly Extreme|   9393.3|42638.292909070165|
|Camping Equipment|     EverGlow Single|  19396.5|42638.292909070165|
+-----------------+--------------------+---------+------------------+

Now lets do the same but order the result by the revenue in descending (large to small) order. In Python we’d need to read all the data first and order it by revenue, so lets not even go there. In SQL we just add one expression ORDER BY Revenue DESC, e.g.:

spark.sql("""
SELECT
    `Product line`,
    Product,
    Revenue,
    (SELECT AVG(Revenue) FROM sales) AS `Avg Revenue`
FROM sales
WHERE Year = 2012
ORDER BY Revenue DESC
LIMIT 10
""").show()
+--------------------+------------+----------+------------------+
|        Product line|     Product|   Revenue|       Avg Revenue|
+--------------------+------------+----------+------------------+
|   Camping Equipment|   Star Lite|1210413.68|42638.292909070165|
|Personal Accessories|        Zone| 1042285.0|42638.292909070165|
|Personal Accessories|        Zone| 1009957.9|42638.292909070165|
|   Camping Equipment|Star Gazer 2| 944385.75|42638.292909070165|
|   Camping Equipment|Star Gazer 3| 799330.96|42638.292909070165|
|Personal Accessories|        Zone|  745868.9|42638.292909070165|
|   Camping Equipment|   Star Lite| 726804.88|42638.292909070165|
|Personal Accessories|        Zone|  711955.2|42638.292909070165|
|   Camping Equipment|   Star Lite| 706259.02|42638.292909070165|
|   Camping Equipment|Star Gazer 2| 683242.56|42638.292909070165|
+--------------------+------------+----------+------------------+

This simplicity of data querying in SQL is possible because the language is declarative, namely, we state what we want and not how we want to do it, as we do in imperative languages. The complexity of query execution is now offloaded to the engine, which does the actual work on data, but has (usually) some internal optimizations, which make is as fast as possible for the given system; in theory at least. Also, if the table would be in a database, e.g. PostgreSQL, we would do exactly the same query and get the same result, maybe apart form the query result return implementation, but it would be very similar.

Thus, now we can do more fancy queries like grouping as below:

spark.sql("""
SELECT
    `Product line`,
    AVG(Revenue) AS `Avg Revenue`,
    SUM(Quantity) AS `Sum Quantity`,
    MAX(Quantity) AS `Max Quantity`
FROM sales
GROUP BY `Product line`
ORDER BY `Avg Revenue` DESC
""").show()
+--------------------+------------------+------------+------------+
|        Product line|       Avg Revenue|Sum Quantity|Max Quantity|
+--------------------+------------------+------------+------------+
|      Golf Equipment| 73783.81206980941|     4020719|       17904|
|Mountaineering Eq...|51574.988404884796|     9900091|       67875|
|   Camping Equipment| 50512.76144011912|    21406096|       35122|
|Personal Accessories|38033.354059620106|    27335366|       42877|
|  Outdoor Protection| 4620.507561484914|     6400089|       24379|
+--------------------+------------------+------------+------------+

This way we can get some characteristics of data sets very quickly and without too much details taken to perform the implementation.

But that’s not all! We can do quite elaborate calculations in SQL, like using window functions, e.g:

spark.sql("""
SELECT  *
FROM (
 SELECT
   `Product line`,
   Product,
   Year,
   `Revenue`,
   RANK() OVER(
     PARTITION BY `Product line` ORDER BY Revenue DESC
     ) AS Rank,
   (Revenue - (SELECT AVG(Revenue) FROM sales))
     AS `Diff Revenue`
  FROM sales
  )
WHERE Rank <= 3
ORDER BY `Product line`
""").show()
+--------------------+--------------------+----+----------+----+------------------+
|        Product line|             Product|Year|   Revenue|Rank|      Diff Revenue|
+--------------------+--------------------+----+----------+----+------------------+
|   Camping Equipment|        Star Gazer 2|2014| 1486717.1|   1|  1444078.80709093|
|   Camping Equipment|           Star Lite|2013|1415141.91|   2|1372503.6170909298|
|   Camping Equipment|        Star Gazer 2|2013| 1335112.9|   3|1292474.6070909298|
|      Golf Equipment|Hailstorm Titaniu...|2014|1635687.96|   1|1593049.6670909298|
|      Golf Equipment|Hailstorm Titaniu...|2014| 1388659.5|   2|1346021.2070909298|
|      Golf Equipment|Hailstorm Titaniu...|2013|1226669.53|   3|1184031.2370909299|
|Mountaineering Eq...|     Granite Extreme|2013|  725496.0|   1| 682857.7070909298|
|Mountaineering Eq...|     Granite Extreme|2013|  710828.0|   3| 668189.7070909298|
|Mountaineering Eq...|      Husky Rope 100|2014| 712340.79|   2| 669702.4970909299|
|  Outdoor Protection|   BugShield Extreme|2012| 119042.42|   2| 76404.12709092983|
|  Outdoor Protection|   BugShield Extreme|2012| 107273.74|   3| 64635.44709092984|
|  Outdoor Protection|   BugShield Extreme|2012| 160956.39|   1|118318.09709092985|
|Personal Accessories|                Zone|2013|1230450.95|   1|1187812.6570909298|
|Personal Accessories|                Zone|2013| 1118965.3|   2|  1076327.00709093|
|Personal Accessories|                Zone|2012| 1042285.0|   3| 999646.7070909298|
+--------------------+--------------------+----+----------+----+------------------+

Here, we’ve calculated the top 3 products by the revenue in every product line, and also calculated how much more than average revenue we got on these products. Note that the same products may had different revenue each year.

We can also perform some window aggregates like ROLLUP, i.e. partial function evaluation by a grouping column, e.g.:

spark.sql("""
SELECT
    `Product line`,
    Year,
    COUNT(*) AS Count,
    AVG(Revenue) AS `Avg Revenue`,
    SUM(Revenue) AS `Sum Revenue`
FROM sales
GROUP BY `Product line`, Year WITH ROLLUP
ORDER BY `Product line`, Year
""").show()
+--------------------+----+-----+------------------+--------------------+
|        Product line|Year|Count|       Avg Revenue|         Sum Revenue|
+--------------------+----+-----+------------------+--------------------+
|                null|null|88475|42638.292909070165| 3.772422965129983E9|
|   Camping Equipment|null|24866| 50512.76144011912| 1.256050325970002E9|
|   Camping Equipment|2012| 9807| 41068.37699296411|4.0275757316999906E8|
|   Camping Equipment|2013| 9250|54095.397062702745| 5.003824228300004E8|
|   Camping Equipment|2014| 5809| 60752.33774660014| 3.529103299700002E8|
|      Golf Equipment|null| 7764| 73783.81206980941| 5.728575169100003E8|
|      Golf Equipment|2012| 3051| 55066.02001638817| 1.680064270700003E8|
|      Golf Equipment|2013| 2847| 80825.52530734114|2.3011027055000025E8|
|      Golf Equipment|2014| 1866| 93644.59769024656|1.7474081929000008E8|
|Mountaineering Eq...|null| 7943|51574.988404884796| 4.096601328999999E8|
|Mountaineering Eq...|2012| 3255| 32903.12133333326|1.0709965993999977E8|
|Mountaineering Eq...|2013| 2766| 58221.19423716561|1.6103982326000008E8|
|Mountaineering Eq...|2014| 1922| 73631.97174817898|1.4152064970000002E8|
|  Outdoor Protection|null| 8620| 4620.507561484914|3.9828775179999955E7|
|  Outdoor Protection|2012| 3205| 7802.987232449289|2.5008574079999972E7|
|  Outdoor Protection|2013| 3374|3067.3313100177807|1.0349175839999992E7|
|  Outdoor Protection|2014| 2041| 2190.605222929936|          4471025.26|
|Personal Accessories|null|39282|38033.354059620106| 1.494026214169997E9|
|Personal Accessories|2012|14810| 30811.84037137066| 4.563233558999995E8|
|Personal Accessories|2013|14786|40173.773056945756|      5.9400940842E8|
+--------------------+----+-----+------------------+--------------------+
only showing top 20 rows

Note that compared to a simple grouping we’ve got records where grouping column is null, which are result of functions, sum and average in our case, evaluated for all the years for a given product line. Roll-up gives also a total function value for all the years and product lines.

There is many more interesting window functions is SQL, see for instance this blog post about support for SQL functions in Spark, and we haven’t even got to useful features like joins or partitioning yet.

I hope that now you can see that SQL is worth looking into, especially if you’re interested in data analytics. Apart from the language itself, it gives a new way of thinking about how the data queries are done and can bring vast improvement in the time performance of your quotes, as a lot of optimizations are done under the hood for you.

The original source notebook is available here.

Contact

How can we help?

Drop us a line and we'll respond as soon as possible.

Treble