Blog

Word count is Spark SQL with a pinch of TF-IDF (continued)

Posted By Jakub Nowacki, 31 August 2017

In this post we continue with the example introduced last week to calculate TF-IDF measures and find the most characteristic words for each of the analysed books.

TF-IDF

The term TF-IDF stands for Term Frequency — Inverse Document Frequency and it is a very popular data mining and analytics algorithm for finding words characteristic for a given document. It consists of two parts, where the final measure is the product of TF and IDF measures. Note that here I describe this statistic very briefly; you can find more in-depth descriptions online, like the Wikipedia article of this tutorial.

First measure, Term Frequency (TF), is very similar to what we already did for the word count, but TF is calculated per document and normalized, i.e. divided by the count of all words in the book. We can put it in an equation as follows:

where: $n_w$ is a number of occurrences of a particular word in a document and $n_c$ is the overall number of words in a document.

We can use the above equation on our Spark DF as follows:

from pyspark.sql.window import Window

w = Window.partitionBy(book_words['book'])

book_tf = book_words.groupBy('book', 'word')\
    .agg(func.count('*').alias('n_w'),
         func.sum(func.count('*')).over(w).alias('n_d'),
         (func.count('*')/func.sum(func.count('*')).over(w)).alias('tf')
        )\
    .orderBy('n_w', ascending=False)\
    .cache()

book_tf.show(truncate=15)
+---------------+----+-----+------+---------------+
|           book|word|  n_w|   n_d|             tf|
+---------------+----+-----+------+---------------+
|  war_and_peace| the|34725|586914|0.0591653973...|
|  war_and_peace| and|22307|586914|0.0380072719...|
|  war_and_peace|  to|16755|586914|0.0285476236...|
|  war_and_peace|  of|15008|586914|0.0255710376...|
|      moby_dick| the|14718|222629|0.0661099856...|
|  war_and_peace|   a|10584|586914|0.0180333064...|
|  war_and_peace|  he|10007|586914|0.0170501981...|
|  war_and_peace|  in| 9036|586914|0.0153957820...|
|  war_and_peace|that| 8206|586914|0.0139816054...|
|        dracula| the| 8093|166949|0.0484758818...|
|  war_and_peace| his| 7984|586914|0.0136033558...|
|  war_and_peace| was| 7360|586914|0.0125401677...|
|grimms_fairy...| the| 7224|105336|0.0685805422...|
|      moby_dick|  of| 6743|222629|0.0302880577...|
|      moby_dick| and| 6518|222629|0.0292774077...|
|        dracula| and| 5976|166949|0.0357953626...|
|  war_and_peace|with| 5710|586914|0.0097288529...|
|  war_and_peace|  it| 5617|586914|0.0095703970...|
|grimms_fairy...| and| 5551|105336|0.0526980329...|
|  war_and_peace| had| 5365|586914|0.0091410325...|
+---------------+----+-----+------+---------------+
only showing top 20 rows

Since the measure is calculated over a document, I use a window function to perform the overall count of words in one step. See the documentation or this blog post for more details.

As we have seen before, stop words tend to get the highest TF values, as they appear most often in the document. This is clearly not information we are trying to get, as it does not say much about how characteristic these words really are. With the help comes the second measure, Inverse Documents Frequency (IDF), which is defined as:

where $c_d$ is a total number of documents and $i_d$ is a number of documents with the term in them; note that $\log$ is a natural logarithm with base $e$.

Now we apply the above equation to our Spark DF as follows:

w = Window.partitionBy('word')

c_d = book_words.select('book').distinct().count()

book_idf = book_words.groupBy('word', 'book').agg(
        func.lit(c_d).alias('c_d'),
        func.count('*').over(w).alias('i_d'),
        func.log(func.lit(c_d)/func.count('*').over(w)).alias('idf')
    )\
    .orderBy('idf', ascending=False)\
    .cache()

book_idf.show(truncate=15)
+-----------+---------------+---+---+---------------+
|       word|           book|c_d|i_d|            idf|
+-----------+---------------+---+---+---------------+
|         07|  war_and_peace|  6|  1|1.7917594692...|
|  archduchy|  war_and_peace|  6|  1|1.7917594692...|
|ascriptions|      moby_dick|  6|  1|1.7917594692...|
|     bazaar|  war_and_peace|  6|  1|1.7917594692...|
|   besmoked|      moby_dick|  6|  1|1.7917594692...|
|     bleeve|     tom_sawyer|  6|  1|1.7917594692...|
|   bottomed|      moby_dick|  6|  1|1.7917594692...|
|   bowsprit|      moby_dick|  6|  1|1.7917594692...|
|        ces|  war_and_peace|  6|  1|1.7917594692...|
|     ceylon|      moby_dick|  6|  1|1.7917594692...|
|       clog|        dracula|  6|  1|1.7917594692...|
|     coaxed|     tom_sawyer|  6|  1|1.7917594692...|
|   coverlid|      moby_dick|  6|  1|1.7917594692...|
|   digester|      moby_dick|  6|  1|1.7917594692...|
|   douleurs|  war_and_peace|  6|  1|1.7917594692...|
| draggingly|      moby_dick|  6|  1|1.7917594692...|
|       earl|      moby_dick|  6|  1|1.7917594692...|
| electrical|        dracula|  6|  1|1.7917594692...|
|       ells|grimms_fairy...|  6|  1|1.7917594692...|
|  elocution|  war_and_peace|  6|  1|1.7917594692...|
+-----------+---------------+---+---+---------------+
only showing top 20 rows

IDF gives high values for words that occur in fewer documents as clearly visible in the table. Also, in this case we need to calculate measures using window functions.

Now we can combine the two data frames using join on the book and word columns. After the join we can add one additional column for calculated TF-IDF as follows:

book_tfidf = book_tf.join(book_idf, ['book', 'word'])\
    .withColumn('tf_idf', func.col('tf') * func.col('idf'))\
    .cache()

book_tfidf.orderBy('tf_idf', ascending=False).show(truncate=12)
+------------+--------+----+------+------------+---+---+------------+------------+
|        book|    word| n_w|   n_d|          tf|c_d|i_d|         idf|      tf_idf|
+------------+--------+----+------+------------+---+---+------------+------------+
|war_and_p...|  pierre|1963|586914|0.0033446...|  6|  1|1.7917594...|0.0059927...|
|  tom_sawyer|    huck| 258| 77612|0.0033242...|  6|  1|1.7917594...|0.0059562...|
|  tom_sawyer|     tom| 824| 77612|0.0106169...|  6|  4|0.4054651...|0.0043047...|
|   moby_dick|    ahab| 517|222629|0.0023222...|  6|  1|1.7917594...|0.0041609...|
|   moby_dick|   whale|1244|222629|0.0055877...|  6|  3|0.6931471...|0.0038731...|
|     dracula| helsing| 323|166949|0.0019347...|  6|  1|1.7917594...|0.0034665...|
|     dracula|    lucy| 301|166949|0.0018029...|  6|  1|1.7917594...|0.0032304...|
|war_and_p...|    rost| 965|586914|0.0016441...|  6|  1|1.7917594...|0.0029459...|
|  tom_sawyer|   becky| 115| 77612|0.0014817...|  6|  1|1.7917594...|0.0026549...|
|     dracula|    mina| 244|166949|0.0014615...|  6|  1|1.7917594...|0.0026186...|
|  tom_sawyer|     joe| 170| 77612|0.0021903...|  6|  2|1.0986122...|0.0024063...|
|war_and_p...|     sha|1273|586914|0.0021689...|  6|  2|1.0986122...|0.0023828...|
|war_and_p...|  prince|1928|586914|0.0032849...|  6|  3|0.6931471...|0.0022769...|
|war_and_p...|     nat|1215|586914|0.0020701...|  6|  2|1.0986122...|0.0022742...|
|war_and_p...|  moscow| 722|586914|0.0012301...|  6|  1|1.7917594...|0.0022041...|
|   moby_dick|  whales| 271|222629|0.0012172...|  6|  1|1.7917594...|0.0021810...|
|     dracula|     van| 323|166949|0.0019347...|  6|  2|1.0986122...|0.0021255...|
|   moby_dick|   stubb| 261|222629|0.0011723...|  6|  1|1.7917594...|0.0021005...|
|   moby_dick|queequeg| 253|222629|0.0011364...|  6|  1|1.7917594...|0.0020361...|
|     dracula|  harker| 175|166949|0.0010482...|  6|  1|1.7917594...|0.0018781...|
+------------+--------+----+------+------------+---+---+------------+------------+
only showing top 20 rows

Note that all the measures are in the resulting DF; normally we’d just keep the results and drop other columns, which we will not use later, as they artificially will inflate the data size, especially for larger data sets.

The above table has all the required information, but the order is just done by the total TF-IDF. Let’s use the window functions to sort the data frame and return only 5 top words per TF-IDF for every book. It can be done as follows:

w = Window.partitionBy('book').orderBy(func.col('tf_idf').desc())

book_tfidf.withColumn('rank', func.rank().over(w))\
    .where('rank <= 5')\
    .drop('rank')\
    .orderBy('book', 'tf_idf')\
    .select('book', 'word', 'tf', 'idf', 'tf_idf')\
    .show(truncate=12, n=30)
+------------+------------+------------+------------+------------+
|        book|        word|          tf|         idf|      tf_idf|
+------------+------------+------------+------------+------------+
|     dracula|      harker|0.0010482...|1.7917594...|0.0018781...|
|     dracula|         van|0.0019347...|1.0986122...|0.0021255...|
|     dracula|        mina|0.0014615...|1.7917594...|0.0026186...|
|     dracula|        lucy|0.0018029...|1.7917594...|0.0032304...|
|     dracula|     helsing|0.0019347...|1.7917594...|0.0034665...|
|frankenstein|frankenstein|4.0777317...|1.7917594...|7.3063144...|
|frankenstein|   elizabeth|0.0011723...|0.6931471...|8.1260962...|
|frankenstein|       felix|6.3714558...|1.7917594...|0.0011416...|
|frankenstein|     justine|7.0086014...|1.7917594...|0.0012557...|
|frankenstein|     clerval|7.5183179...|1.7917594...|0.0013471...|
|grimms_fa...|      hansel|4.4619123...|1.7917594...|7.9946737...|
|grimms_fa...|       dwarf|5.8859269...|1.7917594...|0.0010546...|
|grimms_fa...|      tailor|6.5504670...|1.7917594...|0.0011736...|
|grimms_fa...|        hans|0.0011866...|1.0986122...|0.0013036...|
|grimms_fa...|      gretel|9.3035619...|1.7917594...|0.0016669...|
|   moby_dick|    queequeg|0.0011364...|1.7917594...|0.0020361...|
|   moby_dick|       stubb|0.0011723...|1.7917594...|0.0021005...|
|   moby_dick|      whales|0.0012172...|1.7917594...|0.0021810...|
|   moby_dick|       whale|0.0055877...|0.6931471...|0.0038731...|
|   moby_dick|        ahab|0.0023222...|1.7917594...|0.0041609...|
|  tom_sawyer|       injun|9.6634541...|1.7917594...|0.0017314...|
|  tom_sawyer|         joe|0.0021903...|1.0986122...|0.0024063...|
|  tom_sawyer|       becky|0.0014817...|1.7917594...|0.0026549...|
|  tom_sawyer|         tom|0.0106169...|0.4054651...|0.0043047...|
|  tom_sawyer|        huck|0.0033242...|1.7917594...|0.0059562...|
|war_and_p...|         nat|0.0020701...|1.0986122...|0.0022742...|
|war_and_p...|      prince|0.0032849...|0.6931471...|0.0022769...|
|war_and_p...|         sha|0.0021689...|1.0986122...|0.0023828...|
|war_and_p...|        rost|0.0016441...|1.7917594...|0.0029459...|
|war_and_p...|      pierre|0.0033446...|1.7917594...|0.0059927...|
+------------+------------+------------+------------+------------+

Now it is much more clear that the top TF-IDF words per book, i.e. the document, are characteristic for it.

TF-IDF is a useful measure to use and it is a popular indicator by itself but also a useful machine learning feature often used in models. The above example shows the more manual approach to calculate this measure. In practice, especially in machine learning, existing functions can be used; e.g. in Spark MLlib TF-IDF is available out of the box. Also, mind that since IDF is calculated for all the documents in the data set, once the number of document changes, you need to redo the calculations.

The original source notebook is available here.

Contact

How can we help?

Drop us a line and we'll respond as soon as possible.

Treble