Deploying and Using Scheduler Actions

Important

This feature is currently only found on the develop branch and is slated to full release with v0.23.0.

If we examine one of the largest challenges with the supported batch executors it is the decoupling of two distinct schedulers; GitLab pipelines and the underlying HPC system. This central issue can lead to increased queue times for CI jobs and workarounds that involve setting incredibly long GitLab timeouts. Unfortunately, there is no currently available universal solution to this problem. Instead we wish to provide ways to improve the flexibility of our supported executors and allow for a wider array of workflow integrations without placing additional burdens on already limited scheduled resources.

To understand this feature it’s worth reviewing how the existing supported executors operate. Each CI job results in a new, single submission to the underlying scheduler. Meaning each job will need to sit in queue before it can begin. This time in queue counts towards a CI job’s timeout.

This is where the new feature comes into play. By building upon the exciting work of the RADIUSS Shared CI project, Jacamar CI will now offer native support for user defined SCHEDULER_ACTIONS. These actions will allow for a higher degree of project flexibility when interacting with existing, supported executors:

  • allocate: Request resources without running any command/script. The CI job is considered complete as soon as the compute resources have started running.

  • cancel: Close/cancel an existing job.

  • detach: Submit the job to the underlying system, mark the CI job as complete as soon as it has started to running on the allocated compute resources.

  • reattach: Use an existing compute resource to execute the proposed CI job script.

This can result in a single GitLab pipeline being linked to a job submitted to the underlying scheduler:

../../_images/slurm_actions.svg

Support coming soon…

Configuring and Deploying

Important

Due to potential increase in runner utilization (concurrency) this feature must be enabled by the runner administrator.

It is strongly advised that enabling this feature is coupled with an increase in the available runner concurrency to avoid wasted compute cycles by CI jobs stuck waiting on available runners. The exact increase depends on runner utilization, a good starting place would be to double your current value.

Note

If you manage both batch and shell executors from the same configuration file be aware of the runners.limit configuration.

Once prepared you simply need to enable this feature in your Jacamar CI configuration:

[batch]
  scheduler_actions = true

Additional Configurations

Substantially increasing the runner concurrency may necessitate additional modification to Jacamar CI configuration to improve performance. We do not advise modifying these when you initially deploy this feature. Gather feedback as every instance will experience it’s own unique requirements.

[auth]
  run_stage_allowlist = [
    "prepare_script", "get_sources", "restore_cache",
    "download_artifacts", "build_script", "archive_cache",
    "archive_cache_on_failure", "upload_artifacts_on_success",
    "upload_artifacts_on_failure", "cleanup_file_variables"
  ]

[batch]
  nfs_timeout = "15s"
  command_delay = "30s"
  • auth.run_stage_allowlist: An allowlist of run stages. Preventing something like the after_script from running can help greatly limit the risk of extensive resource utilization on your job submission node but may break user workflows.

  • batch.command_delay: This represents the delay between actions that involve the underlying scheduler, for example submitting a job and checking its status.

Monitoring Usage

You will likely need to refine your runner’s configuration over time. As such it might prove valuable to review select projects in order to identify the time they are stuck in queue. Remember this is the queue time a job waits for a runner to be available, not the time it waits on the underlying batch scheduler:

#!/usr/bin/env python3
# coding=utf-8

import datetime
import gitlab
import os
import statistics

# Set associated environment variables before running script.
url = os.getenv('GITLAB_SERVER_URL')
project_id = os.getenv('GITLAB_PROJECT_ID')
private_token=os.getenv('GITLAB_TOKEN')
tag=os.getenv('GITLAB_RUNNER_TAG')


gl = gitlab.Gitlab(url=url, private_token=private_token)
proj = gl.projects.get(project_id)

print("Examining job utilization and time waiting for project '{0}' on {1}".format(
    proj.name_with_namespace, url))

tar_end = datetime.datetime.now() - datetime.timedelta(days=7)
jobs = proj.jobs.list(iterator=True)

queued_durations = []

for job in jobs:
    if datetime.datetime.strptime(job.created_at.split('T')[0], '%Y-%m-%d') < tar_end:
        break
    if tag in job.tag_list:
        if job.queued_duration is None:
            queued_durations.append(0)
        else:
            queued_durations.append(job.queued_duration)

print("\tTotal jobs: {}".format(len(queued_durations)))
print("\tMean: {}s".format(round(statistics.mean(queued_durations))))
print("\tMedian: {}s".format(round(statistics.median(queued_durations))))

This script will only examine a single project over the course of a week. Your case may differ and you may need to make modifications to this example. We’ve chosen to use the python-gitlab package for simplicity, however, the GitLab API is fully available/documented for all versions.

Utilizing Feature in CI/CD Pipelines

SCHEDULER_ACTION

Slurm

default

sbatch

allocate

salloc

cancel

scancel

detach

sbatch

reattach

srun

Basic Workflow

Running a basic workflow is the best way to start with this feature.

stages:
  - run

variables:
  # Allocating and re-attaching to jobs requires a unique key. When not supplied the
  # default "ci-build-${CI_PIPELINE_ID}" value will be used.
  SCHEDULER_JOB_KEY: ci-build-${CI_PIPELINE_ID}

# First begin by claiming a compute resource, this job will wait until the requested
# allocation is in a running state before ending.
request_allocation:
  stage: .pre
  variables:
    # In this job we will be submitting our request using 'salloc'.
    SCHEDULER_ACTION: allocate
    SCHEDULER_PARAMETERS: "-N1 --account=example --time=2"
  script:
    # The 'script' keyword is required by GitLab to have a valid CI YAML file. However,
    # in this case we will not run the resulting script so the contents do not matter.
    - placeholder
  # The 'after_script' will execute as you might traditionally expect; however, be advised
  # that subsequent jobs in your pipeline will be stuck waiting until completion.
  after_script:
    - example..

# Reattach to the previously claimed allocation, each will run as
# an individual tasks in the requested node.
use_allocation:
  stage: run
  parallel: 3
  variables:
    # Each job will run within the previously allocated resource using 'srun'.
    SCHEDULER_ACTION: reattach
    SCHEDULER_PARAMETERS: "--time=1"
  script:
    - hostname

# Finally we have to ensure the requested allocation is canceled.
close_allocation:
  stage: .post
  variables:
    SCHEDULER_ACTION: cancel
  rules:
    # Ensuring that the job always runs is important for avoiding
    # cases where an allocation remain inadvertently running and wasting
    # your project's available cycles.
    - when: always
  script:
    # Any before_script+script will not be run.
    -  placeholder

Detaching Job

Detaching a job is a simple process and acts almost identical to a traditional job.

submit_job:
  script:
    - make run
  variables:
    SCHEDULER_PARAMETERS: "-N4 --account=example --time=60"
    SCHEDULER_ACTION: detach
    # We can choose to define a key for the job if it make other external
    # workflows or rely on the default.
    # SCHEDULER_JOB_KEY: ci-build-${CI_PIPELINE_ID}

Spack Pipeline

Spack CI Pipelines offer a mechanism by with you can generate a GitLab CI pipeline (.gitlab-ci.yml) based upon a valid Spack Environment (spack.yaml). You could use SCHEDULER_ACTION within this structure in order to claim an allocation and ensure each individual package building build occurs on the compute resource.

spack:
  view: false
  concretizer:
    unify: false

  specs:
    - raja

  mirrors: { "mirror": "file:///example/project/cache" }

  ci:
    enable-artifacts-buildcache: False
    rebuild-index: False
    pipeline-gen:
    - any-job:
        tags: [slurm]
        before_script:
          - source /example/project/spack/share/spack/setup-env.sh
        id_tokens:
          CI_JOB_JWT:
            aud: https://gitlab.example.com
        variables:
          SCHEDULER_PARAMETERS: "--exclusive --time=20"
          SCHEDULER_ACTION: reattach
stages:
  - generate
  - prep
  - build

variables:
  SCHEDULER_JOB_KEY: ci-build-${CI_PIPELINE_ID}

default:
  tags: [slurm]
  id_tokens:
    CI_JOB_JWT:
      aud: https://gitlab.example.com

generate-pipeline:
  stage: generate
  tags: [shell]
  before_script:
    - source /example/project/spack/share/spack/setup-env.sh
  script:
    - spack env activate --without-view .
    - spack -d ci generate --artifacts-root "${CI_PROJECT_DIR}/jobs_scratch_dir" --output-file "${CI_PROJECT_DIR}/jobs_scratch_dir/pipeline.yml"
  artifacts:
    paths:
      - jobs_scratch_dir

claim-resource:
  stage: prep
  script:
    - placeholder
  variables:
    SCHEDULER_PARAMETERS: "-N1 --partition=example --time=60"
    SCHEDULER_ACTION: allocate

build-jobs:
  stage: build
  trigger:
    forward:
      # We need to ensure that our SCHEDULER_JOB_KEY variable is surfaced
      # in the child pipeline.
      pipeline_variables: true
    include:
      - artifact: "jobs_scratch_dir/pipeline.yml"
        job: generate-pipeline
    strategy: depend

close-resources:
  stage: .post
  script:
    - placeholder
  variables:
    SCHEDULER_ACTION: cancel
  rules:
    - when: always

I encourage you to review the feature in the official Spack documentation, this example is simply a means to highlight a potential workflow that could benefit from SCHEDULER_ACTIONS.

Dynamic Parameters

There can exist cases when working with SCHEDULER_PARAMETERS where you might need to introduce runtime decisions. For example, maybe depending on the time of day you’ll change the scope of requested resources or based upon build requirement request additional nodes. Regardless, simply referring to a script in the SCHEDULER_PARAMETERS variable will allow it to run, and based upon the JSON return decisions can be made about the job submission:

variables:
  SCHEDULER_PARAMETERS: "${CI_PROJECT_DIR}/scheduler.bash"
#!/bin/bash

# args - Equivalent to traditional SCHEDULER_PARAMETERS.
# skip - Indicates if a job should be completed with exit status 0, without submitting a job script.

echo '{"args":"-N1 --account example --queue ci","skip":false}'

Mitigating Wasted Cycles

Note

As part of the support for this feature we are always interested in additional ways to lower the risk of wasting allocated resources. Feedback and ideas are welcome.

A primary risk of using this feature is that the decoupling of a requested resource from the original pipeline has the chance of leading to wasted cycles and under utilization. In order to avoid this there are several recommendations you should consider when deploying your project’s pipeline.

  • Always ensure that the cancel action is configured to run in your pipeline.

  • Using the GitLab Web UI to cancel jobs/pipelines or the interruptible is advised as it will result in the allocation being properly released.

  • Set a reasonable wall-clock for your allocation and all associated actions.

  • Avoid time consuming actions such as uploading larger artifacts, caching, or similar scripts that could be easily accomplished with less costly resources.

  • Leverage needs in your pipeline in order realize DAGS and avoid cases where jobs can be limited by the traditional stages.