A Docker image is a way of housing an application’s code in a package that can be executed on any platform. This allows programmers to use different versions of the same software, each tailored for the specific computer it runs on and how much memory it has.
The “docker for data science” is a tool that allows you to create a Docker image for Python Data Science libraries. The process of creating the image can be done manually, or automatically with a script.
If you’re working on a project that requires many operating systems, creating a Docker image for Python data science libraries may be a nightmare. You’ve come to the correct spot if you’re experiencing problems installing Python data science libraries.
You’ll learn how to create a Docker image for Python data science libraries using not one, but two techniques in this article.
Ready? Let’s get started!
Prerequisites
This will be a hands-on presentation in this course. If you want to follow along, make sure you have the following items on hand:
- In this tutorial, Docker Desktop or Docker Engine version 20.10.8 is utilized.
- A Windows 10 computer — This tutorial utilizes Windows 10 OS Build 19042.1165, however it will also work with other Docker-compatible operating systems.
For library testing, I’m using a Jupyter Notebook Docker Image.
One method to utilize Docker images for your Python data science libraries is to use Jupyter notebook base images. The official Docker hub’s Jupyter Project Docker images allow you to save time by installing many libraries at once.
The Juypter Notebook Docker image is an online application that allows you to create and share documents with live code, such Python code.
1. Run the docker run command in PowerShell as an administrator to build a running container of Jupyter Notebook’s basic image, all-spark-notebook, on your host system.
The all-spark-notebook picture is tagged with newest and titled ata datasci docker in this example.
Choose the right Jupyter Notebook image for your projects by following Jupyter’s instructions.
—name ata datasci docker docker run -p 8888:8888 jupyter/all-spark-notebook:latest
Because this is Docker’s first time getting the image from Jupyter, the download process takes a while.
Docker Image for all-spark-notebook is being downloaded.
2. To visit Jupyter Lab on your web browser from the container, press Ctrl and click on the final URL, or copy and paste the URL, starting with 127.0.0.1.
In this example, localhost:8888 on the host computer links to a token-based notebook server.
Running Jupyter Docker image container connecting to localhost:8888
3. Open your favourite web browser and go to the copied URL to see a fresh installation of the Jupyter server as well as all of the essential Python data science libraries.
As illustrated below, choose the New button and then pick Python 3 (pykernal). This will open a new tab with an unnamed Python 3 powered notebook, which you will see in the following step.
Creating a Jupyter Notebook from Scratch
4. Paste the instructions below into the first line of the new Python 3 Jupyter notebook (ln [1]). To execute the instructions to import the libraries into the Jupyter notebook, press Shift+Enter.
matplotlib import import pandas import numpy import pandas import numpy import matplotlib import pandas import numpy
Numpy, pandas, and matplotlib are the most popular data science libraries in Python. You may need more data science libraries for your use case, but these three will enough for most small-scale data science initiatives. Look at websites like Calm Code to see whether Python libraries are appropriate for your project.
Using a Jupyter notebook to import Python libraries
5. Paste the following instructions into the input box of the Python 3 notebook (In [#]), then press Shift+Enter to run them. This allows you to check whether each library you imported is operational.
a. Putting the numpy Library to the test
Create a number array using the instructions below, then let numpy compute and report the greatest value from the numpy test array.
# Make a numpy array by using numpy test = numpy.array ([9,1,2,3,6]) # See whether numpy calculates the array’s maximum value. numpy test # Numpy.max(numpy test) prints the largest value in the numpy test array.
Using a Jupyter notebook to test the numpy library
c. Putting the pandas Library to the test
Create and print an example dataframe, a two-dimensional data structure, in a table format with two columns, name and age, using the instructions below.
# Make a pandas dataframe with the name and age columns sample = pandas. DataFrame(columns=[‘name’,’age’]) # Add a row for HelenMary, who is 18 years old, and so on. sample loc[1] = [‘HelenMary’,18] loc[1] = [‘HelenMary’,18] loc[1] = [‘Helen loc[2] = [‘Adam’,44] sample loc[2] = loc[2] = loc[2] = loc[2] = loc[2 [‘Arman’,25] loc[3] = # In the Jupyter notebook example, show the generated dataframe.
Using a Jupyter notebook to test the pandas library
d. Putting the Matplotlib Library to the Test
The instructions below generate a bar chart using the sample dataframe you used to test the numpy library before.
Pandas and numpy both allow you to do arithmetic computations and data manipulations on raw data, while matplotlib allows you to correctly display them.
# import matplotlib.pyplot as plt import matplotlib.pyplot as plt import matplotlib.pyplot as plt import matplotlib.pyplot as plt import matplotlib.pyplot as plt import matplotlib # Create a bar chart using the name of the label and the value age from the previous stages. plt.bar(sample[‘name’],sample[‘age’])
The visual representation of the dataframe in chart form is shown below.
Matplotlib Library is being tested.
You only tried three libraries, however depending on the image you choose for your data science project, Jupyter’s Docker images include many more Python libraries.
The all-spark-notebook loaded in this sample, for example, may be used to do large-scale data processing tasks using Apache Spark. If you don’t require quite as much computational power, a Jupyter Docker image like minimal-notebook would suffice.
Working with Slim Python Images’ Minimal Setup
Jupyter’s Docker images are useful for installing bundles of Python libraries for data science that work together, as opposed to the prior way of utilizing a Jupyter notebook. However, Jupyter’s containers may become overly large or overburdened with functionality.
For your data science project, you could select a simple setup. In that scenario, consider using Python’s official Docker images, which give you greater control, emphasize good performance, and are still easy to use. Furthermore, certified Docker images include all of Python’s most recent changes.
1. In the current directory, use the command below to create an empty Dockerfile. To get a lightweight Linux container powered by Python’s official image from Docker’s hub, you’ll need this Dockerfile.
The 3.9.7-slim-bullseye version is used in this demonstration, although more Python versions are available on the official Python hub. Choose depending on your use case and Python version preference.
For Linux-based operating systems, you can perform the same thing using touch Dockerfile.
2. Then, using your preferred text editor, open the Dockerfile and copy/paste the code below into it. In the code below, change the maintainer value to your name and add a custom description to your liking.
This code has a few of tasks to complete:
- Pulls the Python 3.9.7 thin bullseye picture in particular.
- LABEL commands are used to add descriptions to the image, which are reflected in Docker hub.
- Once the Docker container has been started, provide the working directory.
- Installs nbterm, numpy, matplotlib, seaborn, and pandas, among other Python libraries.
# python:3.9.7-slim-bullseye python:3.9.7-slim-bullseye python:3.9.7-slim-bullseye python:3.9.7-slim # Descriptions of images LABEL “Adam the Automator – H” LABEL “Adam the Automator – H” LABEL “Adam the Automator – version=”0.1” LABEL “data science environment base” is the LABEL description. # WORKDIR /data specifies the working directory. # The Python data science libraries are installed. USE pip install nbterm numpy matplotlib seaborn pandas to install nbterm, numpy, matplotlib, and seaborn pandas.
3. Navigate to the location where you stored your Dockerfile in the working directory. To create your own data science image, ds slim env, in your working directory, use the docker command below (.).
For this demonstration, the image is called ds slim env, but you may name it anything you like. ds slim env docker build -t
ds slim env docker build -t
Installing basic Python data science packages and creating an image
4. Now run the docker below command to list all Docker images on your computer (image ls) and see whether the ds slim env image is there. ls docker image
Using Python Slim and Jupyter Docker Hubs to Compare Data Science Docker Images
5. To use the data science environment (ds slim ev), use the command below to start an interactive (-it) container called minimal env. As you’ll see in the following step, the command will transport you to the Docker container’s Linux shell terminal (/bin/bash).
/bin/bash docker run -it minimal env ds slim env
6. Finally, execute the command below to see what Python version you have installed. Take note of the version, since it will be useful for installing libraries in the future.
/bin/bash docker run -it minimal env ds slim env
Verifying the Python Version
7. To install a Python kernel and access nbterm, the command-line equivalent of the Jupyter notebook, use the instructions below. You may test libraries by installing a Python kernel.
# pip install ipykernel (Python Kernel) # Open nbterm (the command-line equivalent of the Jupyter notebook)
The command, as you can see below, leads you to an interface that looks like a Jupyter notebook but without the browser server’s weight.
Using Jupyter Notebook’s command-line interface (nbterm)
8. Type the following instructions into the cell (ln [1]), as displayed. To execute the instructions in the cell, press Esc, then Ctrl+E. These instructions import the libraries pandas, numpty, and seaborn.
seaborn import pandas import numpy import pandas import numpy import pandas import numpy import pandas import
Using nbterm to import libraries
9. Press Esc, then B to open a new cell and type the instructions shown below. To execute the instructions in the cell, press Esc, then Ctrl+E as you did before (step eight).
Create and print an example dataframe in a table format with two columns using the instructions below (name and age).
pandas.DataFrame(columns=[‘name’,’age’]) sample = pandas.DataFrame(columns=[‘name’,’age’]) [‘HelenMary’,18] sample.loc[1] = sample. [‘Adam’,44] loc[2] = sample. [‘Arman’,25] loc[3] = sample
You may also do additional tasks, like as creating for loops and manipulating datasets, to get insight.
Understanding Python Loops and Flow Control: A Beginner’s Guide
The example dataframe appears appropriately in a tabular format, as seen below.
Using nbterm to test Pandas Library
You may also try the additional libraries in the “Working with Jupyter Notebook Setup” section, like you did in step five.
You may also download flat files containing data, like as CSV files, and turn them into a pandas dataframe before transforming and visualizing the information using the various libraries.
How to Manage and Read CSV Files in Python
To quit nbterm and return to the container terminal shell, hit Esc, then Ctrl+Q twice. To return to the host machine’s regular Windows 10 PowerShell command line, type exit.
Conclusion
You learnt two approaches for creating a Docker image for Python data science libraries in this lesson. The first solution uses pre-existing Jupyter Docker images, whereas the second uses a minimal/lightweight Docker image from Python’s official image repository.
In both techniques, you learnt how to install a few data science libraries and even develop your own Docker image using a Dockerfile. As you can see, setting up a data science environment using Docker is a bit of a challenge, but it’s well worth the effort!
Why not try deploying a data-driven website Docker image to AWS resources as a next step?
The “dockerfile with ubuntu and python” is a docker file that creates an image for Python Data Science Libraries. The user will need to install Ubuntu 16.04 LTS, Python 3.6, and Pip before running the container.
Frequently Asked Questions
How do I create a Docker image for Python?
A: There are a few ways to create Docker images for your Python applications. Here is the one I like, and would recommend that you follow if trying this yourself:
The following tutorial will walk you through how to use docker-compose on macOS or Linux.
Is Docker used in data science?
A: Yes, Docker is a containerization software that basically allows apps to be run and managed in lightweight environments. This means they can be easily deployed across multiple operating systems.
Can Docker be used with Python?
A: Yes, it is possible to use Docker with Python. It will depend on what you want the system to do, but most likely you cant just run any script that exists within a docker container without some modifications
Related Tags
- python data science container
- docker data science image
- best docker images for data science
- docker for data science pdf
- docker data science environment