gcp-storage-emulator Add functionality that allow to pre-load data into the storage bucket(s)

It would be really useful, if the Docker container could start with pre-loaded data. This can help the usage of the tool for unit and integration testing. This could be done by either attaching a volume with data to pre-load or provide a hook script that is called before the server starts. Similar functionality is implemented in other GCP Storage Emulators and also in other common Docker images like databases (postgres Docker image has the /docker-entrypoint-initdb.d directory where the user can have SQL scripts for database initialization and data import).

Sep 02 '21 14:09 MiltiadisKoutsokeras

https://github.com/oittaa/gcp-storage-emulator#docker

The directory used for the emulated storage is located under /storage in the container. In the following example the host's directory $(pwd)/cloudstorage will be bound to the emulated storage.

Sep 02 '21 22:09 oittaa

This is a directory controlled by the service and is read/write root by default, as the Docker service also runs as root. Additionally this cannot apply in memory backed storage. What I would like to have is a user directory, with user permissions and mounting on the container so at launch it imports all data in there. Ideally the top level directories of the import directory should be used as bucket names. For example the following directory:

import-dir
|_bucket_a
  |_directory_a
  |_directory_b
    |_file_a
    |_file_b
|_bucket_b
  |_directory_c
  |_directory_d
    |_file_e
    |_file_f

Should be loaded on startup and the server should create or use the buckets bucket_a, bucket_b (in memory or disk) and upload the corresponding files into the proper bucket.

Sep 03 '21 08:09 MiltiadisKoutsokeras

Yeah, that sounds like a good idea. I don't have much time at the moment, but pull requests are welcome.

Sep 03 '21 12:09 oittaa

@MiltiadisKoutsokeras just FYI https://github.com/fsouza/fake-gcs-server has the behavior you're after.

For our use-case we actually don't want that behavior and and are trying to move to gcp-storage-emulator instead. But I figured I would drop a note in case you're still in need of that.

Oct 22 '21 19:10 mike-marcacci

I have come up with a solution to the problem. Here it goes.

First I use Docker Compose to launch the container with these directives:

google_storage:
        image: oittaa/gcp-storage-emulator
        restart: unless-stopped
        ports:
            # Exposed in port 9023 of localhost
            - "127.0.0.1:9023:9023/tcp"
        environment:
            ####################################################################
            # Application environment variables
            PROJECT_ID: ${PROJECT_ID:-localtesting}
        entrypoint: /entrypoint.sh
        command: ["gcp-storage-emulator", "start",
            "--host=google_storage", "--port=9023", "--in-memory",
            "--default-bucket=${BUCKET_NAME:-localtesting_bucket}" ]
        volumes:
            - ./tests/storage/entrypoint.sh:/entrypoint.sh:ro
            - ./tests/storage/docker_entrypoint_init.py:/docker_entrypoint_init.py:ro
            - ./tests/storage/buckets:/docker-entrypoint-init-storage:ro

As you can see I pass the desired project name and bucket name via Env Vars, PROJECT_ID and BUCKET_NAME. I override the entrypoint of the container with my own Bash script/Python script combination, entrypoint.sh and docker_entrypoint_init.py. Here are their contents:

entrypoint.sh

#!/usr/bin/env bash

# Exit in any error
set -e

[ "${PROJECT_ID}" = "" ] && { echo "PROJECT_ID Environment Variable is not Set!"; exit 1; }

# Install Python requirements
pip install google-cloud-storage==1.31.2

# Execute command line arguments in background and save process ID
"${@}" & PROCESSID=$!

# Wait process to start
while ! kill -0 "${PROCESSID}" >/dev/null 2>&1
do
    echo "Waiting for process to start..."
    sleep 1
done
echo "Process started, ID = ${PROCESSID}"
sleep 2

# Cloud Emulators
export STORAGE_EMULATOR_HOST=http://google_storage:9023

# Import data to bucket
echo "Importing data..."
python3 /docker_entrypoint_init.py
echo "DONE"

# Wait process to exit
wait "${PROCESSID}"

docker_entrypoint_init.py

"""Initialize Google Storage data
"""

import logging
from os import scandir, environ
import sys
from google.auth.credentials import AnonymousCredentials
from google.cloud import storage

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

def upload_contents(client, directory, bucket_name=None):
    """Upload recursively contents of specified directory.

    Args:
        client (google.cloud.storage.Client): Google Storage Client.
        directory (str): upload directory path.
        bucket_name (str, optional): Bucket name to use for upload. Defaults to
        None.
    """
    for entry in scandir(directory):
        print(entry.path)
        if entry.is_dir():
            if bucket_name is not None:
                # This is a normal directory inside a bucket
                upload_contents(client, directory + '/' +
                                entry.name, bucket_name)
            else:
                # This is a bucket directory
                upload_contents(client, directory + '/' +
                                entry.name, entry.name)
        elif entry.is_file():
            if bucket_name is not None:
                tokens = entry.path.split(bucket_name + '/')
                bucket_obj = client.bucket(bucket_name)
                if len(tokens) > 1:
                    gs_path = tokens[1]
                    blob_obj = bucket_obj.blob(gs_path)
                    blob_obj.upload_from_filename(entry.path)

PROJECT_ID = environ.get('PROJECT_ID')
if PROJECT_ID is None:
    logger.error('Missing required Environment Variables! Please set \
PROJECT_ID')
    sys.exit(1)

storage_client = storage.Client(credentials=AnonymousCredentials(),
                                project=PROJECT_ID)

# Scan import data directory
upload_contents(storage_client, '/docker-entrypoint-init-storage')

logger.info('Successfully imported bucket data!')
logger.info('List:')
for bucket in storage_client.list_buckets():
    print(f'Bucket: {bucket}')
    for blob in bucket.list_blobs():
        print(f'|_Blob: {blob}')

# All OK
sys.exit(0)

I hope this is helpful.

Feb 04 '22 12:02 MiltiadisKoutsokeras

gcp-storage-emulator gcp-storage-emulator copied to clipboard

Add functionality that allow to pre-load data into the storage bucket(s)

gcp-storage-emulator
gcp-storage-emulator copied to clipboard