Storage Guide
File Lifecycle Policy
pdfRest is not intended as a file storage solution, but as a document and PDF processing API. There are two main Environment Variables that are configured at container runtime which determine the File Lifecycle Policy of documents in Storage.
1 for true or 2 for false. Defaults to 1 for true.REMOVE_ORIGINAL_PROCESSED to be 1. Defaults to 1800000 (30 minutes).By combining REMOVE_ORIGINAL_PROCESSED and REMOVE_ORIGINAL_PROCESSED_DELAY you are able to configure whether files are deleted from Storage, and how long they're retained for. The timer for REMOVE_ORIGINAL_PROCESSED_DELAY starts as soon as operations are completed on the individual file.
- The timer starts on an uploaded file as soon as it's done uploading.
- If a five page PDF is split into five individual pages, five different timers will start as soon as each file is done processing and added to Storage.
Options for Storage
When deploying the pdfRest API Toolkit Container in your environment, you'll be deploying one or many containers depending on the requirements of your workloads. As noted above, pdfRest is not supported as a file storage solution. Storage should be considered ephemeral in all of the below scenarios.
/opt/datalogics/public on the container.Local Storage on a Single Container
In the most simple scenario, when deploying a single Toolkit Container instance, it's possible to use the Local Storage of the Docker container to upload, handle requests, and deposit processed files before they're retrieved.
Local Storage on Multiple Containers
It is possible to deploy multiple containers using their individual Local Storage, but some very important capabilities are not supported in this configuration.
Without Shared Storage, the following limits apply when sending requests to multiple containers:
- Load-balancing of API requests is not supported.
- Containers will not have access to files and
idvalues of documents processed on other containers. - Chaining API requests between different containers is not possible without first downloading, then re-uploading related files.
Shared Storage for Any Number of Containers
Setting up Shared Storage for one to many containers is both easy and the most feature-rich configuration. When a number of pdfRest API Toolkit Container deployments share the same Storage, they are able to receive API processing requests for documents that were previously uploaded or processed by other containers. By placing a pool of containers behind a load-balancer, the pdfRest service is now has High Availability (HA). In most scenarios, the volume used as Shared Storage is also persistent and files will be available no matter how many containers are deployed, even zero.
Benefits of Shared Storage:
- Ability to deploy pdfRest API Toolkit as a High Availability service.
- Autoscaling capabilities, from zero to many instances of the pdfRest API Toolkit Container.
- Support for a Load Balancer in front of the pool of pdfRest API instances.
- On persistent volumes, files are not lost when some or all containers are terminated.
- Individual containers have access to files and
idvalues generated by other containers. - API requests sent to containers (directly or through a Load Balancer) will be able to chain API processing requests via file
idvalues.
Configuring Local Storage
The Deployment Guide shows how you define the storage volumes in the Docker Compose file by including the following:
volumes:
- /tmp:/opt/datalogics/public
In the example above, the /tmp volume of the Docker container is mounted to the /opt/datalogics/public inside of the container.
Configuring Shared Storage
Docker Volume Configuration
If you require Shared Storage between multiple pdfRest containers, set up a shared volume as described in the Docker storage volume documentation and configure the volumes section of the YAML to mount that volume as shown below:
volumes:
- <your_volume>:/opt/datalogics/public
Update <your_volume> with the volume path that you intend to mount to /opt/datalogics/public.
AWS Elastic Container Service
When configuring Shared Storage for an ECS Service the following information may help.
In the Task Definition, under ContainerDefinitions, you will need to set MountPoints that define the the name of the volume that will be mounted to /opt/datalogics/public. In this case, we'll use a generic EFS_VOLUME value:
MountPoints:
- SourceVolume: EFS_VOLUME
ContainerPath: /opt/datalogics/public
We'll define the volume itself next. In this guide, we are using an Amazon EFS File System as our volume. Further down in the Task Definition, we'll define the File System like so:
Volumes:
- Name: EFS_VOLUME
EFSVolumeConfiguration:
FilesystemId: EFS_ID
RootDirectory: /
TransitEncryption: ENABLED
The EFS_ID is the "File System ID" found in the EFS console. It will look like this (sample) value: fileSystemId: fs-0954e0g8331u790ca. The value of EFS_VOLUME does not matter as long as it matches in both places.
The EFS File System needs to allow traffic on Port 2049, you might not need an access point for traffic from ECS.
Finally, be sure that the IAM role used for the taskRoleArn has the appropriate EFS permissions:
"Action": [
"elasticfilesystem:ClientMount",
"elasticfilesystem:ClientWrite",
"elasticfilesystem:DescribeMountTargets",
"elasticfilesystem:DescribeFileSystems"
],