Add docker pipeline

Creating a custom workflow on Basepair

Overview

Basepair allows you to import and run a custom bioinformatics workflow in a few simple steps. We recommend packaging individual steps of the workflow into docker containers for portability, version control, and reproducibility. Then, using YAML text files, the individual steps (modules) can be defined and stitched together to create the workflow. YAML format is supported by all major programming languages, easy to edit and read, and let's you keep your modules and workflows in version control. After creating the workflow, it then of course needs to be tested to ensure it is running properly and producing the expected results. *Alternatively, you can provide your custom pipeline to Basepair as-is, and we can Dockerize/build the pipeline as well. Feel free to check out our pages on creating and storing Docker images for more information.

Define the modules

A "module" refers to a single tool or a group of tools that run as a single unit. For example, a module may be:

a single command using Picard MarkDuplicates to remove duplicate reads from a BAM file - or -
a call to BWA to align the data as well as a SAMtools call to sort and index the alignments In the 2nd example above, BWA and SAMtools calls are combined in a single module because many tools require both sorted and indexed BAM files for downstream analysis. This way, the module performs the complete logical task of alignment, even though it is calling two different software tools. Importantly, modules are often used across multiple workflows, so it's important that command options and input files are not hard-coded, but instead exposed as parameters in the docker run call. The module YAML file defines 3 primary pieces information: Path to executables and command structure Inputs Outputs

Define the workflow

A Basepair workflow is a collection of modules that run together. A workflow may have a single module, or 20+ modules performing QC, alignment, and generating figures, etc. The workflow is defined as a directed acyclic graph (DAG), where each module is a node in the graph. This structure allows a module to have multiple parent modules along with further flexibility in designing a workflow. The workflow YAML file defines 3 primary pieces of information: A collection of nodes, each representing a module with a unique node id. If a module is required multiple times in a workflow, each instance is assigned a different node id. Furthermore, default parameters can be set for each module. A collection of edges, each representing the parent-child relationship between all modules. Starting with a root module, all other modules are connected in a way that defines how the workflow will run. Mappings: Common global meta-data is directly passed onto the modules, such as the genome assembly being used

Testing

Basepair provides testing frameworks for two purposes: Technical testing: testing the structure of modules and workflows to check if the parameters are passed on correctly, commands are formed as expected, flow of data is moving as expected, etc. Scientific testing: Testing the output files of a workflow with validated results to ensure that the data is being correctly analyzed. This testing is important to establish the scientific accuracy of the workflows.

Dockerizing your pipeline

Docker containers provide an isolated environment in order to both distribute and run software tools with ease. We recommend placing each major step of the pipeline into its own docker container, to be run in a step-wise fashion. This page provides instructions for containerizing your pipeline with docker (information on saving docker images remotely can be found here).

Step 1: create the Dockerfile In a folder containing any necessary scripts or files, create and open a file called "Dockerfile":

nano Dockerfile

Specify a base image to build upon (e.g. an empty ubuntu image), by adding to the file:

From ubuntu

Install any required dependencies(e.g. install, make, gcc, and git):

RUN apt-get update -qq \
&& apt-get install -y make gcc \
&& apt-get install -y git

(use the -y flag to ensure installation continues through any prompts)

If installing a tool from github:

RUN git clone https://github.com/alexdobin/STAR.git

For R scripts requiring external packages, install these packages with:

RUN R -e "install.packages('BiocManager')"
RUN R -e "BiocManager::install('DESeq2')"

Specify the program to run when container is started

ENTRYPOINT ["/STAR-2.7.10b/bin/Linux_x86_64/STAR"]

Exit and save the Dockerfile

Step 2: build the docker image

Specify the Dockerfile and a tag (-t) by naming the image and assigning a version number (e.g. ":0.0.1")

docker build -f Dockerfile -t image_name:0.0.1 .

The docker image tag can be changed in order to push to your dockerhub repo with:

docker tag image_name:0.0.1 dockerhub_user_name/image_name:0.0.1

Step 3: push to dockerhub

docker push dockerhub_user_name/image_name:0.0.

Storing your docker images

There are many options available for storing docker images.

Storage in a public repository

For pipelines that are publicly available, dockerhub is the simplest option. Dockerhub allows unlimited public repositories with a free account, where anyone can pull your images.

An existing image can be copied and renamed using your dockerhub user ID with:

docker tag image_name:0.0.1 userID/image_name:0.0.1

And then pushed to you dockerhub repository with:

docker push userID/image_name:0.0.1

Storage in a private repository

To store your docker images privately, we recommend amazon's elastic container registry (ECR). With ECR, you have two options:

Option 1: Store the image under your own aws account and setup the proper IAM policy to allow Basepair to pull images from your repository
Option 2: Store the image in Basepair's ECR:

We will setup the proper IAM policy allowing you to push images to a private repository in our ECR.

To do this, first contact a Basepair team member in order to share the necessary aws credentials.
Once Basepair has created the repository, we will share the information needed for you to push images.

*In the example code below, you will need to change values in brackets [] for your use:

Step 1: build the image using a tag that contains the repositoryUri we shared with you, the image name you want to use, and version ("v0.0.1")
```
cd /path/to/your/project # this is where the Dockerfile is present
docker build -f Dockerfile -t [repositoryUri]:[image_name]-[v0.0.1] .
```
check to make sure the image was created by listing existing images on your machine:
```
docker image ls
```

Step 2: get a temporary docker credential from ECR for use with docker login

aws ecr get-login-password \
--region [region]  \
| docker login --username AWS \
--password-stdin [aws_id].dkr.ecr.
[region].amazonaws.com

Step 3: push the image to Basepair's ECR using the repositoryUri we shared with you
```
docker push [repositoryUri]:[image_name]-[v0.0.1]
```

Creating a custom workflow on Basepair​

Overview​

Define the modules​

Define the workflow​

Testing​

Dockerizing your pipeline​

Storing your docker images​

Storage in a public repository​

Storage in a private repository​

On this page