Artifacts, Caching, and Local Storage for HPC Projects
It is not uncommon that some of our codes carry with them particularly large storage requirements for not just our application but also typically the dependencies. This evolving guide is focused on providing tips and best practices to store desired job results in a repeatable manner. Though we will provide a brief overview of GitLab artifacts, caching, and the differences between them. We highly recommend reviewing the upstream documentation if you have yet to use either of these tools.
Job Artifacts
Artifacts are a well supported mechanism by which a list of files/directories can be declared in a job and uploaded to the GitLab server upon its completion.
my-job: artifacts: paths: - binaries/ expire_in: 5 days
From this point, upon completion of the CI job, the entire binaries/
directory will be available not just to subsequent stages of the
pipeline but manually in the job results webpage.
A traditional workflow might involve generating the binaries during a build stage, capturing them via artifacts, and then having them available to all jobs in a subsequent test stage.
Note
Maximum artifact size is configured by your GitLab server administrator. Please check with site local deployment documentation for specific details.
Your job log will always clearly reflect the presence of these artifacts:
Downloading artifacts Downloading artifacts for upload (579951664) Runtime platform arch=amd64 os=linux pid=7851 revision=5349ce9b version=13.0.0 Downloading artifacts from coordinator... ok id=579951664 responseStatus=200 OK
There have been no changes to the deployed runners that would affect default behavior of the artifacts functionality.
Artifact Dependencies and Needs
As highlighted in the dependencies
documentation artifacts, by default, are made available to all jobs in
subsequent stages. However, using dependencies: [...]
this behavior
can be modified.
source-job: artifacts: ... download-source-only: dependencies: [source-job] ... download-none: dependencies: [] ...
Properly leveraging dependencies can help improve job performance by avoiding the process of uploading/download large file size in addition to the greater benefit that comes from avoiding unexpected test conditions when multiple artifacts exists of the same files.
The needs functionality
that enabled the creation of directed acyclic graphs functions
similarly. All artifacts are managed automatically based upon the
jobs that have been specified as needs: [...]
.
If desired you can disable
artifact downloads
using a different mechanism.
Problems Faced
Undoubtedly the biggest issue teams will face when using artifacts is the size requirements. Artifacts can be limited in terms of acceptable size by the server administrators and there is not shortage of projects that will require multiple gigabytes of space for their completely built application/library.
To compound the problem GitLab does not compress the artifacts before uploading them to the server. Instead only ZIP is relied upon. This is done for a number of technical reasons and can be found mentioned on an upstream issue.
As a team you will need to work around these limitations and there is sadly no easy solution. But we recommend reviewing the
Job Caching
GitLab offers a caching mechanism as part of the CI pipeline. When caching is defined for a CI job all identified files/directories will be compressed or moved to a runner managed storage location. Everything cached can then be automatically retrieved during subsequent CI job. Unlike artifacts all caches can exists across pipelines but still remain exclusive to a project.
GitLab has a series of well defined user cases for managing caches that we recommend reviewing.
Note
The exact file path for each CI job is not guaranteed to be identical, even
when attempting to use the same runner. Aspects such as the runner short token,
administrator configuration, and more likely other concurrent jobs all play
a part in the CI_PROJECT_DIR
you job will be in. Non-relocatable
binaries or fully qualified paths (e.g. ctest) may lead to
unexpected behaviors.
Clearing a Cache
Gitlab offers documentation for clearing the cache that is accurate but with one major caveat to keep in mind. The Clean Runner Caches button only decrements the cache index, it does not trigger the removal of a now invalided cache.
Setuid Enhancements
With the setuid functionality enabled all the GitLab caching mechanism will be entirely handled in user space. As such they will be stored in on the local file system owned by that user and unaccessible to others. Though a cache can exist across pipelines, with setuid enabled that is only true if the pipeline trigger user remains consistent.
Warning
Depending on facility rules caches developed as part of CI can count against your quota. Please keep this in mind when defining your cache’s key to maximize reusability.
Cache Dependencies and Policy
Similar to the Artifact Dependencies and Needs section you can share caches
by referring to the declared keys or ignore all previous caches with an empty
hash (cache: {}
). It is also feasible to
maintain and inherit a global configuration.
You can also defined policies
to enforce hwo the job interacts with the cache. By default the policy is
pull-push
but you can set this however best suites your job.
Local Storage
We realize teams may have project CI requirements that do not fit the the traditional GitLab mechanisms for managing results. These requirement could look like:
Already exists infrastructure you maintain locally that can be leveraged as part of a CI pipeline.
Complicated software dependencies that are used across instances of your project need to be build in advance of a job. These often requires hours of compilation and iterate slower than the project itself.
Binaries or results that are non-relocatable.
Input or output aspects for the CI process that must be maintained locally to adherer to project policy.
Job artifacts required from a previous stage are multiple gigabytes in size, surpassing existing admin imposed limitations.
In these sorts of cases using local storage provided to your project by the facility make the most sense. And that is part of the reason the runner operates as a local user, to allow for seamless access to local storage found on the test resources.
As an example, we use Spack to build Go before executing any tests on our own code. Since this can be a relatively large build and remains consistent across different iterations of our project, storing it a central location makes sense.
spack-go: stage: prepare before_script: # A script we use to Git the correct release of the Spack repo. - source test/scripts/deploy_spack.bash - spack env activate -d ${CI_PROJECT_DIR}/test/spack/environment script: - spack install -j 2 go@${GOVERSION} build-tool: stage: build before_script: - source ${SPACK_ROOT}/spack/share/spack/setup-env.sh - spack env activate -d ${CI_PROJECT_DIR}/test/spack/alcf - spack load go@${GOVERSION} script: - make build
This simple example looks only at Spack but you could accomplish this with any sort of package/dependency management tool or even with your own scripts.
Note
Spack offers a mechanism to generate GitLab pipelines that can tied directly into the creation of a build cache. This workflow can be more involved but dramatically improve your project’s flexibility in managing complex software dependencies.
Project/Scratch Space
With setuid enhancements and the potential for use of service account we want to ensure you have similar access to project/scratch space you are accustomed to during traditional runs. There should be not barriers preventing you from using these spaces however you and your team see fit.
Note
Relying on the after_script
as a mechanism to automate cleanup of your
local storage is recommend as it will run regardless if the job has failed.
However, at this time it will not run if the job was canceled or had a timeout.
There is an issue
open to correct this.
Access Between Jobs
Manually maintaining access to files on network storage can be made easier by leveraging predefined variables that are provided to all CI jobs. Something like the CI_PIPELINE_ID will remain consistent throughout an entire pipeline.
job: script: - mkdir -p /my/project/testing-$CI_PIPELINE_ID ... test: script: ... after_script: - rm -rf /my/project/testing-$CI_PIPELINE_ID
Modules
An easy way to improve job times and potential decrease artifact size it to rely on the available local software environment. For instance modules:
before_script: - ml openmpi hdf5 trilinos petsc job-one: script: ...
The obvious tradeoff with modules is the decrease in portability of your jobs. Usually we find this is a worth while trade and in the case of HPC scheduled job using the batch executor you will already have to account for some degree of facility specific environments.
Remote Storage
Remote, in our case anything that is not locally managed, is technically an optional target for storage. Remember anything that can be accomplished with a script can be realized in a CI job. However, you are responsible for adhering to all site policies before transferring data off a system. Just because it is made easier with CI doesn’t mean it is acceptable use.