Artifacts, Caching, and Local Storage for HPC Projects

It is not uncommon that some of our codes carry with them particularly large storage requirements for not just our application but also typically the dependencies. This evolving guide is focused on providing tips and best practices to store desired job results in a repeatable manner. Though we will provide a brief overview of GitLab artifacts, caching, and the differences between them. We highly recommend reviewing the upstream documentation if you have yet to use either of these tools.

Job Artifacts

Artifacts are a well supported mechanism by which a list of files/directories can be declared in a job and uploaded to the GitLab server upon its completion.

my-job:
  artifacts:
    paths:
      - binaries/
    expire_in: 5 days

From this point, upon completion of the CI job, the entire binaries/ directory will be available not just to subsequent stages of the pipeline but manually in the job results webpage.

A traditional workflow might involve generating the binaries during a build stage, capturing them via artifacts, and then having them available to all jobs in a subsequent test stage.

Note

Maximum artifact size is configured by your GitLab server administrator. Please check with site local deployment documentation for specific details.

Your job log will always clearly reflect the presence of these artifacts:

Downloading artifacts
Downloading artifacts for upload (579951664)
Runtime platform                                    arch=amd64 os=linux pid=7851 revision=5349ce9b version=13.0.0
Downloading artifacts from coordinator... ok        id=579951664 responseStatus=200 OK

There have been no changes to the deployed runners that would affect default behavior of the artifacts functionality.

Artifact Dependencies and Needs

As highlighted in the dependencies documentation artifacts, by default, are made available to all jobs in subsequent stages. However, using dependencies: [...] this behavior can be modified.

source-job:
  artifacts:
  ...

download-source-only:
  dependencies: [source-job]
  ...

download-none:
  dependencies: []
  ...

Properly leveraging dependencies can help improve job performance by avoiding the process of uploading/download large file size in addition to the greater benefit that comes from avoiding unexpected test conditions when multiple artifacts exists of the same files.

The needs functionality that enabled the creation of directed acyclic graphs functions similarly. All artifacts are managed automatically based upon the jobs that have been specified as needs: [...]. If desired you can disable artifact downloads using a different mechanism.

Problems Faced

Undoubtedly the biggest issue teams will face when using artifacts is the size requirements. Artifacts can be limited in terms of acceptable size by the server administrators and there is not shortage of projects that will require multiple gigabytes of space for their completely built application/library.

To compound the problem GitLab does not compress the artifacts before uploading them to the server. Instead only ZIP is relied upon. This is done for a number of technical reasons and can be found mentioned on an upstream issue.

As a team you will need to work around these limitations and there is sadly no easy solution. But we recommend reviewing the

Job Caching

GitLab offers a caching mechanism as part of the CI pipeline. When caching is defined for a CI job all identified files/directories will be compressed or moved to a runner managed storage location. Everything cached can then be automatically retrieved during subsequent CI job. Unlike artifacts all caches can exists across pipelines but still remain exclusive to a project.

GitLab has a series of well defined user cases for managing caches that we recommend reviewing.

Note

The exact file path for each CI job is not guaranteed to be identical, even when attempting to use the same runner. Aspects such as the runner short token, administrator configuration, and more likely other concurrent jobs all play a part in the CI_PROJECT_DIR you job will be in. Non-relocatable binaries or fully qualified paths (e.g. ctest) may lead to unexpected behaviors.

Clearing a Cache

Gitlab offers documentation for clearing the cache that is accurate but with one major caveat to keep in mind. The Clean Runner Caches button only decrements the cache index, it does not trigger the removal of a now invalided cache.

Setuid Enhancements

With the setuid functionality enabled all the GitLab caching mechanism will be entirely handled in user space. As such they will be stored in on the local file system owned by that user and unaccessible to others. Though a cache can exist across pipelines, with setuid enabled that is only true if the pipeline trigger user remains consistent.

Warning

Depending on facility rules caches developed as part of CI can count against your quota. Please keep this in mind when defining your cache’s key to maximize reusability.

Cache Dependencies and Policy

Similar to the Artifact Dependencies and Needs section you can share caches by referring to the declared keys or ignore all previous caches with an empty hash (cache: {}). It is also feasible to maintain and inherit a global configuration.

You can also defined policies to enforce hwo the job interacts with the cache. By default the policy is pull-push but you can set this however best suites your job.

Local Storage

We realize teams may have project CI requirements that do not fit the the traditional GitLab mechanisms for managing results. These requirement could look like:

Already exists infrastructure you maintain locally that can be leveraged as part of a CI pipeline.
Complicated software dependencies that are used across instances of your project need to be build in advance of a job. These often requires hours of compilation and iterate slower than the project itself.
Binaries or results that are non-relocatable.
Input or output aspects for the CI process that must be maintained locally to adherer to project policy.
Job artifacts required from a previous stage are multiple gigabytes in size, surpassing existing admin imposed limitations.

In these sorts of cases using local storage provided to your project by the facility make the most sense. And that is part of the reason the runner operates as a local user, to allow for seamless access to local storage found on the test resources.

As an example, we use Spack to build Go before executing any tests on our own code. Since this can be a relatively large build and remains consistent across different iterations of our project, storing it a central location makes sense.

spack-go:
  stage: prepare
  before_script:
    # A script we use to Git the correct release of the Spack repo.
    - source test/scripts/deploy_spack.bash
    - spack env activate -d ${CI_PROJECT_DIR}/test/spack/environment
  script:
    - spack install -j 2 go@${GOVERSION}

build-tool:
  stage: build
  before_script:
    - source ${SPACK_ROOT}/spack/share/spack/setup-env.sh
    - spack env activate -d ${CI_PROJECT_DIR}/test/spack/alcf
    - spack load go@${GOVERSION}
  script:
    - make build

This simple example looks only at Spack but you could accomplish this with any sort of package/dependency management tool or even with your own scripts.

Note

Spack offers a mechanism to generate GitLab pipelines that can tied directly into the creation of a build cache. This workflow can be more involved but dramatically improve your project’s flexibility in managing complex software dependencies.

Project/Scratch Space

With setuid enhancements and the potential for use of service account we want to ensure you have similar access to project/scratch space you are accustomed to during traditional runs. There should be not barriers preventing you from using these spaces however you and your team see fit.

Note

Relying on the after_script as a mechanism to automate cleanup of your local storage is recommend as it will run regardless if the job has failed. However, at this time it will not run if the job was canceled or had a timeout. There is an issue open to correct this.

Access Between Jobs

Manually maintaining access to files on network storage can be made easier by leveraging predefined variables that are provided to all CI jobs. Something like the CI_PIPELINE_ID will remain consistent throughout an entire pipeline.

job:
  script:
    - mkdir -p /my/project/testing-$CI_PIPELINE_ID
    ...

test:
  script:
    ...
  after_script:
    - rm -rf /my/project/testing-$CI_PIPELINE_ID

Modules

An easy way to improve job times and potential decrease artifact size it to rely on the available local software environment. For instance modules:

before_script:
  - ml openmpi hdf5 trilinos petsc

job-one:
  script:
    ...

The obvious tradeoff with modules is the decrease in portability of your jobs. Usually we find this is a worth while trade and in the case of HPC scheduled job using the batch executor you will already have to account for some degree of facility specific environments.

Remote Storage

Remote, in our case anything that is not locally managed, is technically an optional target for storage. Remember anything that can be accomplished with a script can be realized in a CI job. However, you are responsible for adhering to all site policies before transferring data off a system. Just because it is made easier with CI doesn’t mean it is acceptable use.