PySpark and Jupyter Quick local setup with Docker

outcastgeek

How to quickly setup a local PySpark node to run with Jupyter using Docker…

Aspiring Data Scientists and Data Analysts out there looking to quickly get started with PySpark and Jupyter, here is a quick write up to show you how to spin up a local workspace using Docker.

First make sure you have Docker, docker-machine, docker-compose installed on your machine.

  1. Create a new Docker machine: #

    in you terminal, run the following commands

    1
    2
    3
    4
    5
    
    cd /to/your/workspace
    mkdir learning_pyspark && cd learning_pyspark
    mkdir -p code data notebooks
    docker-machine create -d virtualbox SciMachine
    eval `docker-machine env SciMachine`
    1. ### Create your docker configuration files and scripts:

    in your learning_pyspark folder

    1
    2
    
    touch Dockerfile
    touch docker-compose.yml

    The content of the Dockerfile and docker-compose.yml is below:

    Dockerfile #

    1
    2
    3
    4
    5
    6
    7
    8
    
        
    FROM jupyter/pyspark-notebook
    
    MAINTAINER outcastgeek <outcastgeek+docker@gmail.com>
    
    WORKDIR /workspace/notebooks
    
    CMD ["/workspace/start-notebook.sh", "--NotebookApp.base_url=/workspace"]

    docker-compose.yml #

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
        
    learning_pyspark:
    build: .
    restart: always
    ports:
        - "4040:4040"
        - "8888:8888"
    volumes:
        - .:/workspace
  2. Create your startup script: #

    still inside your learning_pyspark folder

    1
    
     touch start-notebook.sh

    Again the content of start-notebook.sh is below:

    start-notebook.sh #

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    #!/bin/bash
    
    # Change UID of NB_USER to NB_UID if it does not match
    if [ "$NB_UID" != $(id -u $NB_USER) ] ; then
        usermod -u $NB_UID $NB_USER
        chown -R $NB_UID $CONDA_DIR
    fi
    
    # Enable sudo if requested
    if [ ! -z "$GRANT_SUDO" ]; then
        echo "$NB_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/notebook
    fi
    
    # Start the notebook server
    exec su $NB_USER -c "env PATH=$PATH jupyter notebook $*"
  3. Run your environment: #

    From within your learning_pyspark folder

    Run your container:

    1
    
     docker-compose up

    Obtain the ip address of your container:

    1
    2
    
    docker-machine ip SciMachine
        
  4. Now get to work: #

    Your Jupyter workspace is available here: http://${SciMachine IP Address}:8888/workspace

    Create a note book and run some PySpark workload in it, then your Spark UI will be available here: http://${SciMachine IP Address}:4040

Feel free to clone https://github.com/outcastgeek/docker_pyspark.git

and play around:

Any questions, feedback, comment?