Container

Storage Guide

Configuring Local or Shared Storage for the pdfRest API Toolkit Container.

File Lifecycle Policy

pdfRest is not intended as a file storage solution, but as a document and PDF processing API. There are two main Environment Variables that are configured at container runtime which determine the File Lifecycle Policy of documents in Storage.

REMOVE_ORIGINAL_PROCESSED
integer
Controls whether uploaded and processed documents are deleted from the system. Set to 1 for true or 2 for false. Defaults to 1 for true.
REMOVE_ORIGINAL_PROCESSED_DELAY
integer
Controls how long before those documents are deleted from the system in milliseconds. Requires REMOVE_ORIGINAL_PROCESSED to be 1. Defaults to 1800000 (30 minutes).

By combining REMOVE_ORIGINAL_PROCESSED and REMOVE_ORIGINAL_PROCESSED_DELAY you are able to configure whether files are deleted from Storage, and how long they're retained for. The timer for REMOVE_ORIGINAL_PROCESSED_DELAY starts as soon as operations are completed on the individual file.

  • The timer starts on an uploaded file as soon as it's done uploading.
  • If a five page PDF is split into five individual pages, five different timers will start as soon as each file is done processing and added to Storage.

Options for Storage

When deploying the pdfRest API Toolkit Container in your environment, you'll be deploying one or many containers depending on the requirements of your workloads. As noted above, pdfRest is not supported as a file storage solution. Storage should be considered ephemeral in all of the below scenarios.

All input, output, and temporary files are delivered to /opt/datalogics/public on the container.

Local Storage on a Single Container

In the most simple scenario, when deploying a single Toolkit Container instance, it's possible to use the Local Storage of the Docker container to upload, handle requests, and deposit processed files before they're retrieved.

Local Storage on Multiple Containers

It is possible to deploy multiple containers using their individual Local Storage, but some very important capabilities are not supported in this configuration.

Without Shared Storage, the following limits apply when sending requests to multiple containers:

  • Load-balancing of API requests is not supported.
  • Containers will not have access to files and id values of documents processed on other containers.
  • Chaining API requests between different containers is not possible without first downloading, then re-uploading related files.

Shared Storage for Any Number of Containers

Setting up Shared Storage for one to many containers is both easy and the most feature-rich configuration. When a number of pdfRest API Toolkit Container deployments share the same Storage, they are able to receive API processing requests for documents that were previously uploaded or processed by other containers. By placing a pool of containers behind a load-balancer, the pdfRest service is now has High Availability (HA). In most scenarios, the volume used as Shared Storage is also persistent and files will be available no matter how many containers are deployed, even zero.

Benefits of Shared Storage:

  • Ability to deploy pdfRest API Toolkit as a High Availability service.
  • Autoscaling capabilities, from zero to many instances of the pdfRest API Toolkit Container.
  • Support for a Load Balancer in front of the pool of pdfRest API instances.
  • On persistent volumes, files are not lost when some or all containers are terminated.
  • Individual containers have access to files and id values generated by other containers.
  • API requests sent to containers (directly or through a Load Balancer) will be able to chain API processing requests via file id values.

Configuring Local Storage

The Deployment Guide shows how you define the storage volumes in the Docker Compose file by including the following:

    volumes:
      - /tmp:/opt/datalogics/public

In the example above, the /tmp volume of the Docker container is mounted to the /opt/datalogics/public inside of the container.

Using Local Storage for multiple containers is the same process as for a single container.

Configuring Shared Storage

Docker Volume Configuration

If you require Shared Storage between multiple pdfRest containers, set up a shared volume as described in the Docker storage volume documentation and configure the volumes section of the YAML to mount that volume as shown below:

volumes:
  - <your_volume>:/opt/datalogics/public

Update <your_volume> with the volume path that you intend to mount to /opt/datalogics/public.

AWS Elastic Container Service

When configuring Shared Storage for an ECS Service the following information may help.

In the Task Definition, under ContainerDefinitions, you will need to set MountPoints that define the the name of the volume that will be mounted to /opt/datalogics/public. In this case, we'll use a generic EFS_VOLUME value:

MountPoints:
          - SourceVolume: EFS_VOLUME
            ContainerPath: /opt/datalogics/public

We'll define the volume itself next. In this guide, we are using an Amazon EFS File System as our volume. Further down in the Task Definition, we'll define the File System like so:

Volumes:
        - Name: EFS_VOLUME
          EFSVolumeConfiguration:
            FilesystemId: EFS_ID
            RootDirectory: /
            TransitEncryption: ENABLED

The EFS_ID is the "File System ID" found in the EFS console. It will look like this (sample) value: fileSystemId: fs-0954e0g8331u790ca. The value of EFS_VOLUME does not matter as long as it matches in both places.

The EFS File System needs to allow traffic on Port 2049, you might not need an access point for traffic from ECS.

Finally, be sure that the IAM role used for the taskRoleArn has the appropriate EFS permissions:

            "Action": [
                "elasticfilesystem:ClientMount",
                "elasticfilesystem:ClientWrite",
                "elasticfilesystem:DescribeMountTargets",
                "elasticfilesystem:DescribeFileSystems"
            ],