Python 3 coming to radanalytics.io

We have been getting some good feedback recently on the radanalytics.io project, and one topic that has been on my mind for some time now is support for Python 3. As I hope you know, Python 2 is moving ever closer to deprecation. I think for most Python fans this is seen as a good thing (TM), and getting all that Python 2 legacy stuff up to date is no small task.

To that end, we are working to enable Python 3 as a first class citizen in our Apache Spark clusters and source-to-image builders. I have created a few experimental images to demonstrate this capability and I am sharing here a workflow for all the eager alpha testers out there.

Please note, this work is evolving and these instructions are ALPHA version!!!

The sources

For starters, we have 2 image sources that I have assembled to support the Python 3 work; an image for the cluster, and an image for the builder.

Cluster image lives here: docker.io/elmiko/openshift-spark:python36-latest

with the source residing at github.com/elmiko/openshift-spark/tree/python36

Builder image lives here: docker.io/elmiko/radanalytics-pyspark:python36-latest

with the source residing at github.com/elmiko/oshinko-s2i/tree/python36

Both of these repositories use the concreate tool for building the images. I won’t discuss how to do that here, but look at the Makefile and related scripts in each repo for more information.

Using the images

Now the fun part, how to start using these images to create Python 3 Apache Spark applications on OpenShift.

For now, I have only tested the workflow where a user will deploy a Spark cluster manually. I will be adding support for automatic clusters but that will take a little longer to implement with testing and whatnot.

Step 1. launch a cluster

To launch a cluster with my custom image, I will use the oshinko cli tool. You can find binary releases here.

oshinko create spy3 --image=elmiko/openshift-spark:python36-latest --masters=1 --workers=1

This will create a cluster named spy3 with 1 master and 1 worker.

Step 2. Setup the builder template for my application

Next I want to launch an application against the cluster. For this I will use our very basic Pi tutorial code.

To make this work properly though, i need to change my source-to-image template that I use for building the application. The following template is what I want:

apiVersion: v1
kind: Template
labels:
  application: oshinko-python-spark
  createdBy: template-oshinko-python36-spark-build-dc
metadata:
  annotations:
    description: Create a buildconfig, imagestream and deploymentconfig using source-to-image and Python Spark source files hosted in git
    openshift.io/display-name: Apache Spark Python
  name: oshinko-python36-spark-build-dc
objects:
- apiVersion: v1
  kind: ImageStream
  metadata:
    name: ${APPLICATION_NAME}
    labels:
      app: ${APPLICATION_NAME}
  spec:
    dockerImageRepository: ${APPLICATION_NAME}
    tags:
    - name: latest
- apiVersion: v1
  kind: BuildConfig
  metadata:
    name: ${APPLICATION_NAME}
    labels:
      app: ${APPLICATION_NAME}
  spec:
    output:
      to:
        kind: ImageStreamTag
        name: ${APPLICATION_NAME}:latest
    source:
      contextDir: ${CONTEXT_DIR}
      git:
        ref: ${GIT_REF}
        uri: ${GIT_URI}
      type: Git
    strategy:
      sourceStrategy:
        env:
        - name: APP_FILE
          value: ${APP_FILE}
        forcePull: true
        from:
          kind: DockerImage
          name: elmiko/radanalytics-pyspark:python36-latest
      type: Source
    triggers:
    - imageChange: {}
      type: ImageChange
    - type: ConfigChange
    - github:
        secret: ${APPLICATION_NAME}
      type: GitHub
    - generic:
        secret: ${APPLICATION_NAME}
      type: Generic
- apiVersion: v1
  kind: DeploymentConfig
  metadata:
    name: ${APPLICATION_NAME}
    labels:
      deploymentConfig: ${APPLICATION_NAME}
      app: ${APPLICATION_NAME}
  spec:
    replicas: 1
    selector:
      deploymentConfig: ${APPLICATION_NAME}
    strategy:
      type: Rolling
    template:
      metadata:
        labels:
          deploymentConfig: ${APPLICATION_NAME}
          app: ${APPLICATION_NAME}
      spec:
        containers:
        - env:
          - name: DRIVER_HOST
            value: ${APPLICATION_NAME}-headless
          - name: OSHINKO_CLUSTER_NAME
            value: ${OSHINKO_CLUSTER_NAME}
          - name: APP_ARGS
            value: ${APP_ARGS}
          - name: SPARK_OPTIONS
            value: ${SPARK_OPTIONS}
          - name: OSHINKO_DEL_CLUSTER
            value: ${OSHINKO_DEL_CLUSTER}
          - name: APP_EXIT
            value: "true"
          - name: OSHINKO_NAMED_CONFIG
            value: ${OSHINKO_NAMED_CONFIG}
          - name: OSHINKO_SPARK_DRIVER_CONFIG
            value: ${OSHINKO_SPARK_DRIVER_CONFIG}
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          image: ${APPLICATION_NAME}
          imagePullPolicy: IfNotPresent
          name: ${APPLICATION_NAME}
          resources: {}
          terminationMessagePath: /dev/termination-log
          volumeMounts:
          - mountPath: /etc/podinfo
            name: podinfo
            readOnly: false
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        serviceAccount: oshinko
        volumes:
        - downwardAPI:
            items:
            - fieldRef:
                fieldPath: metadata.labels
              path: labels
          name: podinfo
    triggers:
    - imageChangeParams:
        automatic: true
        containerNames:
          - ${APPLICATION_NAME}
        from:
          kind: ImageStreamTag
          name: ${APPLICATION_NAME}:latest
      type: ImageChange
    - type: ConfigChange
- apiVersion: v1
  kind: Service
  metadata:
    name: ${APPLICATION_NAME}
    labels:
      app: ${APPLICATION_NAME}
  spec:
    ports:
    - name: 8080-tcp
      port: 8080
      protocol: TCP
      targetPort: 8080
    selector:
      deploymentConfig: ${APPLICATION_NAME}
- apiVersion: v1
  kind: Service
  metadata:
    name: ${APPLICATION_NAME}-headless
    labels:
      app: ${APPLICATION_NAME}
  spec:
    clusterIP: None
    ports:
    - name: driver-rpc-port
      port: 7078
      protocol: TCP
      targetPort: 7078
    - name: blockmanager
      port: 7079
      protocol: TCP
      targetPort: 7079
    selector:
      deploymentConfig: ${APPLICATION_NAME}
parameters:
- description: 'The name to use for the buildconfig, imagestream and deployment components'
  from: 'python-spark-[a-z0-9]{4}'
  generate: expression
  name: APPLICATION_NAME
  required: true
- description: The URL of the repository with your application source code
  displayName: Git Repository URL
  name: GIT_URI
- description: Optional branch, tag or commit
  displayName: Git Reference
  name: GIT_REF
- description: Git sub-directory path
  name: CONTEXT_DIR
- description: The name of the main py file to run. If this is not specified and there is a single py file at top level of the git respository, that file will be chosen.
  name: APP_FILE
- description: Command line arguments to pass to the Spark application
  name: APP_ARGS
- description: List of additional Spark options to pass to spark-submit (for exmaple --conf property=value --conf property=value). Note, --master and --class are set by the launcher and should not be set here
  name: SPARK_OPTIONS
- description: The name of the Spark cluster to run against. The cluster will be created if it does not exist, and a random cluster name will be chosen if this value is left blank.
  name: OSHINKO_CLUSTER_NAME
- description: The name of a stored cluster configuration to use if a cluster is created, default is 'default'.
  name: OSHINKO_NAMED_CONFIG
- description: The name of a configmap to use for the Spark configuration of the driver. If this configmap is empty the default Spark configuration will be used.
  name: OSHINKO_SPARK_DRIVER_CONFIG
- description: If a cluster is created on-demand, delete the cluster when the application finishes if this option is set to 'true'
  name: OSHINKO_DEL_CLUSTER
  required: true
  value: 'true'

You can deploy this template quickly by using this command:

oc create -f https://gist.githubusercontent.com/elmiko/64338bcf36bdbc19de63330dafd5c706/raw/4f01dbcfa90e89f9567fb67ed8128e37dfc2d476/oshinko-python36-spark-build-dc.yaml

Step 3. Launch the application

With everything in place, I am now ready to launch my application. I will use the previously create Spark cluster and my custom template. The following command will build and deploy my sparkpi:

oc new-app --template oshinko-python36-spark-build-dc \
  -p APPLICATION_NAME=sparkpi \
  -p GIT_URI=https://github.com/elmiko/tutorial-sparkpi-python-flask.git \
  -p GIT_REF=python3 \
  -p OSHINKO_CLUSTER_NAME=spy3

I also need to expose a route to my app:

oc expose svc/sparkpi

With all this in place, I can now make a curl request to my application and confirm an approximation of Pi:

$ curl http://`oc get routes/sparkpi --template='{{.spec.host}}'`/sparkpi
Pi is roughly 3.14388

Confirming Python 3 in the images

Much of this might look like magic, if you really want to confirm that Python 3 is in the images and is being used there are a couple options. You can inject some code into your application that will print the Python version, for example:

import sysconfig
print(sysconfig.get_python_version())

On the driver you can use the terminal option in OpenShift to login and run the Python REPL. You should see something like this:

$ oc rsh dc/sparkpi
(app-root) sh-4.2$ python
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

For the Spark master and worker nodes it’s a little more complicated as they will use the software collections enablement commands to invoke the Python REPL, as follows:

$ oc rsh dc/spy3-m
sh-4.2$ scl enable rh-python36 python
Python 3.6.3 (default, Mar 20 2018, 13:50:41) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Hopefully this will be enough to satisfy your curiousity. =)

Bonus: Jupyter notebook connecting to an Oshinko cluster with Python 3

As another small experiment, I have been working to get a Jupyter notebook connected to a Spark cluster spawned by Oshinko with a Python 3 cluster.

I have created a custom image to help this work proceed, you will find the image here: docker.io/elmiko/jupyter-notebook-py36

With the corresponding code here github.com/elmiko/jupyter-notebook-py36

To make this work, you will need to deploy a cluster as indicated above, and then craft your initial notebook cells to utilize the cluster. This is all quite rough currently, but I expect the tooling will become more smooth over time.

Step 1. Launch the notebook

Launch the notebook image by running the following command:

oc new-app elmiko/jupyter-notebook-py36 \
  -e JUPYTER_NOTEBOOK_PASSWORD=foo \
  -e PYSPARK_PYTHON=/opt/rh/rh-python36/root/usr/bin/python

Then expose a route to your notebook:

oc expose svc/jupyter-notebook-py36

Step 2. Attach to a running cluster

To attach the notebook to a running Spark cluster you need to do a little setup on your Spark context. The following code should be used in your first cell to setup this interaction, note these values are higly specific to this image and deployment:

import pyspark
conf=pyspark.SparkConf().setMaster('spark://spy3:7077') \
     .set('spark.driver.host', 'jupyter-notebook-py36') \
     .set('spark.driver.port', 42000) \
     .set('spark.driver.bindAddress', '0.0.0.0') \
     .set('spark.driver.blockManager.port', 42100)
sc=pyspark.SparkContext(conf=conf)

This image shows a view of what it should look like.

Closing thoughts

This work is all evolving at a quick rate, but I sincerely hope we will start to land Python 3 support next week with more enhancements to follow. The radanalytics.io project is all about making it easier for developers to do machine learning and analytics work on OpenShift. Hopefully, these changes will move us forward in that direction.

As always, happy hacking!