Docker for Data Science

Posted By Jakub Nowacki, 18 August 2017

Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.

First you need to get Docker:

There is a number of Spark notebooks, but I have the best experience with the one provided by Jupyter project, namely:; see the link for details about the image’s options.

To download and start image using Docker the simplest command is:

docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook

I personally prefer using docker-compose; see its documentation for details. The simplest docker-compose.yaml file looks as follows:

version: '3'
    image: jupyter/all-spark-notebook
      - "8888:8888"
      - "4040-4080:4040-4080"
      - ./notebooks:/home/jovyan/work/notebooks/

Create this file in a folder you’d like to work in and run the command:

docker-compose up

This will fetch image and start it. Note that Docker commands trigger download of images from the hub, which is about 2 GB of data, thus it may take some time depending on your network. The above command will start the image that will have Jupyter notebook with PySpark configured open on port 8888 of your docker machine:

  • on *nix it will be localhost
  • on Windows Docker Toolbox it will likely be VM IP address, e.g.; check command docker-machine ip

If you start Spark session you can see the Spark UI on one of the ports from 4040 upwards; the session starts UI on the next (+1) port if the current is taken; e.g. if there is spark UI on 4040, the new session will start UI on port 4041. Volumes option will create a notebooks folder, which will map directly to the same folder on the docker machine, thus your files will be saved in your local folder and not in the temporary volume of the machine.

Possible issues:

  • If you have problem with one of the above commands try first: eval "$(docker-machine env default)" and retry the command; it will setup your environment correctly.
  • Docker (original) and Docker Toolbox cannot be installed on the machine at the same time.
  • Sometimes you may have problem with VM getting IP address on some networks, like ones with VPN, disable VPN and restart your machine to retry it; if you still have the problem, check your network setup.
  • Sometimes outside user’s home directory Docker may have issues with setting up volumes; it’s best to have files stored somewhere in your home folder.

But the benefits don’t stop here. It is very easy to test other services. Lets say you’d like to check out Mongo, but you are new to NoSQL databases, let alone setting them up. With Docker you need just to add new service to docker-compose.yml as follows:

version: '3'
    image: jupyter/all-spark-notebook
      - "8888:8888"
      - "4040-4080:4040-4080"
      - ./notebooks:/home/jovyan/work/notebooks/
    image: mongo
      - "27017:27017"

If you now restart using docker-compose command, MongoDB image will be pulled from the hub and set up. I won’t go into MongoDB details in this post, thus, see MongoDB documentation for more details about the database.

Now to use MongoDB in PySpark very easily with the help of its connector; see MongoDB connector documentation for details. See the example below of a PySpark code which uses the connector with the MongoDB container, which you have just set up.

import os
import pyspark

# One way of loading additional packages to spark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

spark = pyspark.sql.SparkSession.builder\
    .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \

l = [("Bilbo Baggins",  50),
     ("Gandalf", 1000),
     ("Thorin", 195),
     ("Balin", 178),
     ("Kili", 77),
     ("Dwalin", 169),
     ("Oin", 167),
     ("Gloin", 158),
     ("Fili", 82),
     ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])    

people.write \
    .format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append") \
    .save() \
    .format("com.mongodb.spark.sql.DefaultSource") \
    .load() \
|                 _id| age|         name|
|[59949521a7b11b00...|  50|Bilbo Baggins|
|[59949521a7b11b00...|1000|      Gandalf|
|[59949521a7b11b00...| 195|       Thorin|
|[59949521a7b11b00...| 178|        Balin|
|[59949521a7b11b00...|  77|         Kili|
|[59949521a7b11b00...| 169|       Dwalin|
|[59949521a7b11b00...| 167|          Oin|
|[59949521a7b11b00...| 158|        Gloin|
|[59949521a7b11b00...|  82|         Fili|
|[59949521a7b11b00...|null|       Bombur|
only showing top 20 rows

Note that the above connector version (2.2.0) is compatible with Spark 2.2.0; if you are using different version, consult the documentation of other connector’s versions.

Also, note how we call the MongoDB service using just its name mongo. The networking here is taken care of by Docker. If you want to read more about Docker networking, see the specific documentation for Docker compose networking. Moreover, the mongo port is provided to the outside world via the port key in docker-compose.yml, so you also can use it with other services, e.g. test processes you start on your local machine; note that you need to use the Docker host address in such case.

Docker Hub has a lot of community-built images you can try out yourself. Note that while it may be good idea to try some of the images out, they may not be production ready, so you need to take this into consideration. If you want to know more, there is a lot of materials describing Docker deployments; great place to start is Docker documentation.


How can we help?

Drop us a line and we'll respond as soon as possible.