Posted By Jakub Nowacki, 18 August 2017
Docker is a very useful tool to package software builds and distribute them onwards. In fact, it’s becoming the standard of application packaging, especially for web services. There are a lot of Docker images available at Docker Hub. In general, Docker is very useful for development, testing and production, but for this tutorial, we’ll show how to use Docker for Data Science and Apache Spark. For this, we’ll get a prebuilt version of Spark with other tools bundled in the package.
First you need to get Docker:
- on *nix, Windows 10 Professional and Enterprise: https://www.docker.com/community-edition#/download
- on other versions of Windows: https://www.docker.com/products/docker-toolbox
- read more about Docker on their website
There is a number of Spark notebooks, but I have the best experience with the one provided by Jupyter project, namely: https://hub.docker.com/r/jupyter/all-spark-notebook/; see the link for details about the image’s options.
To download and start image using Docker the simplest command is:
docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook
I personally prefer using
docker-compose; see its documentation for details. The simplest
docker-compose.yaml file looks as follows:
version: '3' services: spark: image: jupyter/all-spark-notebook ports: - "8888:8888" - "4040-4080:4040-4080" volumes: - ./notebooks:/home/jovyan/work/notebooks/
Create this file in a folder you’d like to work in and run the command:
This will fetch image and start it. Note that Docker commands trigger download of images from the hub, which is about 2 GB of data, thus it may take some time depending on your network. The above command will start the image that will have Jupyter notebook with PySpark configured open on port 8888 of your docker machine:
- on *nix it will be localhost
- on Windows Docker Toolbox it will likely be VM IP address, e.g. 192.168.99.100; check command
If you start Spark session you can see the Spark UI on one of the ports from 4040 upwards; the session starts UI on the next (+1) port if the current is taken; e.g. if there is spark UI on 4040, the new session will start UI on port 4041.
Volumes option will create a
notebooks folder, which will map directly to the same folder on the docker machine, thus your files will be saved in your local folder and not in the temporary volume of the machine.
- If you have problem with one of the above commands try first:
eval "$(docker-machine env default)"and retry the command; it will setup your environment correctly.
- Docker (original) and Docker Toolbox cannot be installed on the machine at the same time.
- Sometimes you may have problem with VM getting IP address on some networks, like ones with VPN, disable VPN and restart your machine to retry it; if you still have the problem, check your network setup.
- Sometimes outside user’s home directory Docker may have issues with setting up volumes; it’s best to have files stored somewhere in your home folder.
But the benefits don’t stop here. It is very easy to test other services. Lets say you’d like to check out Mongo, but you are new to NoSQL databases, let alone setting them up. With Docker you need just to add new service to
docker-compose.yml as follows:
version: '3' services: spark: image: jupyter/all-spark-notebook ports: - "8888:8888" - "4040-4080:4040-4080" volumes: - ./notebooks:/home/jovyan/work/notebooks/ mongo: image: mongo ports: - "27017:27017"
If you now restart using
docker-compose command, MongoDB image will be pulled from the hub and set up. I won’t go into MongoDB details in this post, thus, see MongoDB documentation for more details about the database.
Now to use MongoDB in PySpark very easily with the help of its connector; see MongoDB connector documentation for details. See the example below of a PySpark code which uses the connector with the MongoDB container, which you have just set up.
import os import pyspark # One way of loading additional packages to spark os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show()
+--------------------+----+-------------+ | _id| age| name| +--------------------+----+-------------+ |[59949521a7b11b00...| 50|Bilbo Baggins| |[59949521a7b11b00...|1000| Gandalf| |[59949521a7b11b00...| 195| Thorin| |[59949521a7b11b00...| 178| Balin| |[59949521a7b11b00...| 77| Kili| |[59949521a7b11b00...| 169| Dwalin| |[59949521a7b11b00...| 167| Oin| |[59949521a7b11b00...| 158| Gloin| |[59949521a7b11b00...| 82| Fili| |[59949521a7b11b00...|null| Bombur| +--------------------+----+-------------+ only showing top 20 rows
Note that the above connector version (2.2.0) is compatible with Spark 2.2.0; if you are using different version, consult the documentation of other connector’s versions.
Also, note how we call the MongoDB service using just its name
mongo. The networking here is taken care of by Docker. If you want to read more about Docker networking, see the specific documentation for Docker compose networking. Moreover, the
mongo port is provided to the outside world via the port key in
docker-compose.yml, so you also can use it with other services, e.g. test processes you start on your local machine; note that you need to use the Docker host address in such case.
Docker Hub has a lot of community-built images you can try out yourself. Note that while it may be good idea to try some of the images out, they may not be production ready, so you need to take this into consideration. If you want to know more, there is a lot of materials describing Docker deployments; great place to start is Docker documentation.