Dockerized Selenium: Integrating Docker For Python Selenium Scripts.

Published in

Level Up Coding

6 min readJan 25, 2024

Introduction

In this blog, get ready for an exciting exploration into the dynamic intersection of selenium and docker, an innovative fusion. Running Selenium tests using Docker has become a popular choice among developers for its simplicity and consistency. Docker allows you to bundle your Python Selenium scripts and all their necessary parts into a single container. This makes it easy to run your tests smoothly across different setups.

Let’s briefly dive into the basics of using Docker to run your Selenium tests, showcasing how this approach can make your testing process more straightforward and reliable.

Prerequisites

1. Docker installation

Ensure Docker is installed on your machine.

Reference link: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-20-04

2. Selenium installation using Python

The latest version 4.16.2 of Selenium may not work with the Dockerfile, so install version 4.0.0 in your Dockerfile.

RUN pip install selenium==4.0.0

3. Chrome setup in Docker

The latest available Chrome version is 120.0.6099.10900, but the latest stable version of ChromeDriver is 114.0.5735.110. Therefore, we will use Chrome version 114 along with ChromeDriver version 114.0.5735.110.

Download version 114.0.5735.110 of Chrome from the reference link provided here and add the .deb package to the current working directory.

4. Chromedriver setup in Docker

The latest stable version available for Chrome Driver is 114.0.5735.110, and we will use this version. Please note that the Chrome Driver version must match the Chrome browser version. For example, if the Chrome browser version is 114, then the Chrome Driver version must also be 114.

RUN wget https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_linux64.zip

Selenium script using Python

Create a Python file with the name app.py.

Step-1 Installation of required libraries

Install the required libraries on your system for our Selenium script using the following command.

pip install bs4==0.0.1

pip install selenium==4.0.0

Step-2 Import libraries and functions

import selenium
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium. webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup 
import os

We have imported the necessary libraries and functions required for our Selenium script.

Step-3 Configure variables

search_query = os.environ.get('USER_INPUT', 'image_name')
total_img = int(os.environ.get('TOTAL_IMAGES','num_of_image'))
image_src_list = []

The ‘search_query’ variable captures user input from environment variables, while ‘total_img’ represents the specified number of images retrieved from another environment variable. An empty list, ‘image_src_list,’ is initialized to store the URLs of fetched images. Since we’re conducting an image search on Google, it’s essential to note that we’ll obtain a limited number of URLs for the images.

Step-4 Setting up Selenium driver

def search():
   options = webdriver.ChromeOptions()
   options.add_argument("no-sandbox")
   options.add_argument("headless")
   options.add_argument("disable-gpu")
   driverPath = "/usr/bin/chromedriver"
   driver = webdriver.Chrome(driverPath, chrome_options=options)

We’ve implemented a ‘search’ function that utilizes the Chrome driver located at /usr/bin/chromedriver incorporating specific options. By using ‘headless,’ the function allows Chrome to run without a visible interface. ‘disable-gpu’ is employed to deactivate GPU usage, and ‘no-sandbox’ helps prevent unnecessary pop-ups during execution.

Step-5 Extracting and saving image URLs

 url='https://www.google.com/'
   driver.get(url)
   search_bar = driver.find_element(By.NAME, 'q')
   search_bar.clear()
   search_bar.send_keys(search_query)
   search_bar.submit()
   time.sleep(2)
   images_link = driver.find_element(By.LINK_TEXT, 'Images')
   images_link.click()
   time.sleep(3)


   content = driver.page_source
   with open('content.html','w') as html_file:
       html_file.write(content)
   soup = BeautifulSoup(content, 'html.parser')
   images = soup.find_all('img', class_="rg_i Q4LuWd")
   os.makedirs(search_query.replace(' ', '_'), exist_ok=True)


   for num in range(0,len(images)):
           src = images[num].get('src') or images[num].get('data-src')
           if src:
               if src.startswith(('http', 'data')):
                   image_src_list.append(src)
               if len(image_src_list) == total_img:
                   break
   filename = f"{search_query.replace(' ', '_')}/urls.txt"
   with open(filename, "w") as f:
       for src in image_src_list[:total_img]:
           f.write(src + '\n')

search()

The provided Python code performs web scraping to fetch images related to a specified search query from Google. It utilizes the Selenium library to automate browser actions. The script navigates to the Google Images section, captures the page source, and uses BeautifulSoup for parsing. Images are extracted based on the class and saved to a specified local directory. The ‘total_img’ variable determines the number of images to retrieve, and the results are stored in a text file named after the search query. The process is encapsulated in a function, making it modular and easy to integrate into larger projects. Finally, we call the function.

Create a requirements.txt

Generate a requirements.txt file for your environment by executing the below command.

pip freeze > requirements.txt

It will make a list of your libraries with their versions. This keeps your project’s environment consistent and easy to set up.

Create a Dockerfile

# Use the Ubuntu 22.04 base image
FROM ubuntu:22.04

# Add Python 3.8 to the image
FROM python:3.8

# Set an environment variable to allow user input for image name
ENV USER_INPUT image_name

# Set another environment variable for the total number of images
ENV TOTAL_IMAGES num_of_image

# Update package lists for the Ubuntu system
RUN apt-get update

# Install the 'unzip' package
RUN apt install unzip

# Copy the Chrome Debian package to the image
COPY chrome_114_amd64.deb ./

# Install the Chrome Debian package
RUN apt install ./chrome_114_amd64.deb -y

# Download ChromeDriver binary version 114.0.5735.90 for Linux
RUN wget https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_linux64.zip

# Unzip the downloaded ChromeDriver binary
RUN unzip chromedriver_linux64.zip

# Move the ChromeDriver binary to /usr/bin
RUN mv chromedriver /usr/bin/chromedriver

# Print the version of Google Chrome installed
RUN google-chrome --version

# Set the working directory inside the image to /app
WORKDIR /app

# Install Selenium version 4.0.0 using pip
RUN pip install selenium==4.0.0

# Copy the requirements.txt file to /app
COPY requirements.txt /app/requirements.txt

# Install Python dependencies listed in requirements.txt
RUN pip install -r /app/requirements.txt

# Copy the Python script 'app.py' to /app
COPY app.py /app/

# Declare a volume at the specified path for persistent data storage
VOLUME <Your working directory path>
# (eg:-VOLUME /media/projects/Test)

# Specify the default command to execute when the container starts
ENTRYPOINT [ "python", "app.py"]

Create a Dockerfile named Dockerfile (without any extension) and insert the provided content into it. This Dockerfile defines the creation of a Docker image, starting from the Ubuntu 22.04 and Python 3.8 base images. It updates the package list, installs essential dependencies such as Chrome and ChromeDriver (version 114.0.5735.90), sets the working directory to /app. Installs Selenium version 4.0.0, and copies and installs additional Python dependencies from requirements.txt. The app.py the script is copied into the /app directory, and the entry point is configured to execute this script using the Python interpreter. Additionally, a volume is established at /media/projects/Test allowing data persistence and interaction between the Docker container and the host system.

Deploy a Docker image

1. Build a Docker image

Execute the following command to build a Docker image for our Selenium script.

sudo docker build -t <image-name>

2. Execute a Docker image

docker run -e USER_INPUT=<search query> -e TOTAL_IMAGES=<number of images> -v <Your working directory path>:/app <image-name>

Example:

docker run -e USER_INPUT=insects -e TOTAL_IMAGES=30 -v /media/project/Test:/app img1

This command prompts users to enter a search query and the desired number of images, saving these variables in the Docker environment for seamless execution.

The segment ‘USER_INPUT=<search query> -e TOTAL_IMAGES=<number of images>’ in the command sets the environment variables for USER_INPUT and TOTAL_IMAGES. Additionally, -v /media/project/Test:/app copies the contents of the /app directory in Docker to the local directory /media/project/Test.

Verifying the output

Inspect the directory located at /media/project/Test it will contain the directory name specified by the user in the search query. Within this directory, you'll find a urls.txt file containing the URLs of Google Images corresponding to the search query.
The following is a sample image of the urls.txt file.

Conclusion

In summary, our Python Selenium script efficiently searches for images on Google based on the user’s query. It extracts image URLs from the page source using BeautifulSoup. The script is seamlessly deployed using Docker, ensuring the smooth execution and testing of Selenium operations. This blog takes you through the process, from setting prerequisites to crafting a Dockerfile. Take a look at the Python script to see how well Selenium handles web scraping and manages images. Follow these steps to create a reliable and consistent setup for testing with Dockerized Selenium.

Originally published at Dockerized Selenium: Integrating Docker For Python Selenium Scripts on January 25, 2024.