scrapyrt icon indicating copy to clipboard operation
scrapyrt copied to clipboard

ScrapyRT Port Unreachable in Kubernetes Docker Container Pod

Open doverradio opened this issue 5 months ago • 1 comments

I'm experiencing difficulties in accessing a ScrapyRT service running on specific ports within a Kubernetes pod. My setup includes a Kubernetes cluster with a pod running a Scrapy application, which uses ScrapyRT to listen for incoming requests on designated ports. These requests are intended to trigger spiders on the corresponding ports.

Despite correctly setting up a Kubernetes service and referencing the Scrapy pod in it, I'm unable to receive any incoming requests to the pod. My understanding is that in Kubernetes networking, a service should be created first, followed by the pod, allowing inter-pod communication and external access through the service. Is this correct?

Below are the relevant configurations: 
 scrapy-pod Dockerfile:

# Use Ubuntu as the base image
FROM ubuntu:latest

# Avoid prompts from apt
ENV DEBIAN_FRONTEND=noninteractive

# # Update package repository and install Python, pip, and other utilities
RUN apt-get update && \
    apt-get install -y curl software-properties-common iputils-ping net-tools dnsutils vim build-essential python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*


# Install nvm (Node Version Manager) - EXPRESS
ENV NVM_DIR /usr/local/nvm
ENV NODE_VERSION 16.20.1

RUN mkdir -p $NVM_DIR
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

# Install Node.js and npm - EXPRESS
RUN . "$NVM_DIR/nvm.sh" && nvm install $NODE_VERSION && nvm alias default $NODE_VERSION && nvm use default

# Add Node and npm to path so the commands are available - EXPRESS
ENV NODE_PATH $NVM_DIR/versions/node/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/v$NODE_VERSION/bin:$PATH

# Install Yarn - EXPRESS
RUN npm install --global yarn

# Set the working directory in the container to /usr/src/app
WORKDIR /usr/src/app

# Copy the current directory contents into the container at /usr/src/app
COPY . .

# Install any needed packages specified in requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the start_services.sh script into the container
COPY start_services.sh /start_services.sh

# Make the script executable
RUN chmod +x /start_services.sh


# Install any needed packages specified in package.json using Yarn - EXPRESS
RUN yarn install


# Expose all the necessary ports
EXPOSE 14805 14807 12085 14806 13905 12080 14808 8000


# Define environment variable - EXPRESS
ENV NODE_ENV production

# Run the script when the container starts
CMD ["/start_services.sh"]

start_services.sh:

#!/bin/bash

# Start ScrapyRT instances on different ports
scrapyrt -p 14805 &
scrapyrt -p 14807 &
scrapyrt -p 12085 &
scrapyrt -p 14806 &
scrapyrt -p 13905 &
scrapyrt -p 12080 &
scrapyrt -p 14808 &

# Keep the container running since the ScrapyRT processes are in the background
tail -f /dev/null

service yaml file:

apiVersion: v1
kind: Service
metadata:
  name: scrapy-service
spec:
  selector:
    app: scrapy-pod
  ports:
    - name: port-14805
      protocol: TCP
      port: 14805
      targetPort: 14805
    - name: port-14807
      protocol: TCP
      port: 14807
      targetPort: 14807
    - name: port-12085
      protocol: TCP
      port: 12085
      targetPort: 12085
    - name: port-14806
      protocol: TCP
      port: 14806
      targetPort: 14806
    - name: port-13905
      protocol: TCP
      port: 13905
      targetPort: 13905
    - name: port-12080
      protocol: TCP
      port: 12080
      targetPort: 12080
    - name: port-14808
      protocol: TCP
      port: 14808
      targetPort: 14808
    - name: port-8000
      protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

deployment yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
  labels:
    app: scrapy-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scrapy-pod
  template:
    metadata:
      labels:
        app: scrapy-pod
    spec:
      containers:
      - name: scrapy-pod
        image: mydockerhub/privaterepository-scrapy:latest
        imagePullPolicy: Always  
        ports:
        - containerPort: 14805
        - containerPort: 14806
        - containerPort: 14807
        - containerPort: 12085
        - containerPort: 13905
        - containerPort: 12080
        - containerPort: 8000
        envFrom:
        - secretRef:
            name: scrapy-env-secret
        - secretRef:
            name: express-env-secret
      imagePullSecrets:
      - name: my-docker-credentials 

scrapy-pod's logs in Powershell terminal:

> k logs scrapy-deployment-56b9d66858-p59gs -f
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Log opened.
2024-01-09 21:53:27+0000 [-] Site starting on 12080
2024-01-09 21:53:27+0000 [-] Site starting on 14808
2024-01-09 21:53:27+0000 [-] Site starting on 14805
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f4cbdf44d60>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fef9b620a00>
2024-01-09 21:53:27+0000 [-] Site starting on 13905
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 14807
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f0892ff4df0>
2024-01-09 21:53:27+0000 [-] Site starting on 14806
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f00d3b99000>
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fba9e321180>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7f1782514f10>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Site starting on 12085
2024-01-09 21:53:27+0000 [-] Starting factory <twisted.web.server.Site object at 0x7fb2054cd060>
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.
2024-01-09 21:53:27+0000 [-] Running with reactor: AsyncioSelectorReactor.

Issue: Despite these configurations, no requests seem to reach the Scrapy pod. Logs from kubectl logs show that ScrapyRT instances start successfully on the specified ports. However, when I send requests from a separate debug pod running a Python Jupyter Notebook, they succeed for other pods but not for the Scrapy pod.

Question: How can I successfully connect to the Scrapy pod? What might be preventing the requests from reaching it?

Any insights or suggestions would be greatly appreciated.

doverradio avatar Jan 12 '24 22:01 doverradio