Configurations

All aspects of the Jacamar CI configuration are documented here. In addition, key elements of the GitLab Runner that are related either to the custom executor or general concerns. In both cases the configurations are managed via the TOML format.

Jacamar CI Config

Due to the highly configurable nature of Jacamar CI, it requires its own configuration file. If new to this application we recommend that you first review the admin tutorial.

Unless explicitly noted there are no default string/integer values and unset booleans will be false.

[general] - Table

Key

Description

executor

A required setting that specifies which of the supported executors CI jobs will utilize.

data_dir

A required setting where all files/directories for a job are stored. Strict ownership (user:user) and permissions (0700) are enforced on top level directories.

retain_logs

Keep all files generated by the executor and/or scheduling mechanisms (default: false, removed upon job completion).

custom_build_dir

Observe the directory specified by a user’s CUSTOM_CI_BUILDS_DIR variable, this will replace any builds directory derived from the data_dir. Jacamar will ensure unique paths and appropriate permissions (700). Does not function in conjunction with root_dir_creation.

name

Administrator defined string, (currently) only appears in Jacamar’s system logging capabilities to help distinguish from other instances.

kill_timeout

Maximum timeout (duration string default: 120s) the Jacamar-Auth application will wait before sending a SIGKILL to the underlying Jacamar process if a SIGTERM is captured from a custom executor termination.

shell_path

Shell path to be used when constructing Bash shell for job script execution, when not set will resolve based upon PATH.

job_message

Custom message that will be conveyed at the start of every prepare_exec stage to the user.

gitlab_server

Trusted URL for GitLab server used in all web interactions, takes priority over any values identified in the job response.

tls-ca-file

File containing the required certificates for HTTPS actions.

unrestricted_cmd_line

Allow for unfettered usages of tokens via the command line by all runner generated job scripts.

static_builds_dir

Create a static folder (<data_dir>/user/statis/<job_id>) that will be unique for every job and only removed based upon the static_min_days configuration.

static_min_days

Minimum number of days any static directory can remain (default: 7 with 0 indicating no clenaup required).This will only be enforced during job cleanup and may lead to longer than average job durations.

group_permissions

Set base permissions on Jacamar generated data directories to allow read and execute access for groups (ie, 0750 permission).

jwt_env_variable

Environment variable to be checked for an id_token (default: CI_JOB_JWT)

set_stack_size

Overrides behavior found in RHEL 8 that reverts the stack size to 8m when capabilities are used. This will cause the user’s environment to set the ulimit to match the available hard limit, normally configured through the runner’s systemd file.

user_bash_env

Can be used to define key=value pairs that will be injected into the user’s shell responsible for jobs script and monitoring command execution.

The [general] configuration is applicable to a range of features and can affect both the jacamar-auth and jacamar applications.

[general]
  executor = "shell"
  data_dir = "/ecp"
  retain_logs = true
  custom_build_dir = false
  name = "My Jacamar Driver"
  kill_timeout = "120s"
  job_message = """
  ****************************************************************************
                        NOTICE TO USERS

  This is an example message ....
  ****************************************************************************
  """
  gitlab_server = "https://gitlab.example.com"
  tls-ca-file = "/example/file.crt"
  unrestricted_cmd_line = false
  static_builds_dir = false
  static_min_days = 7
  group_permissions = false
  jwt_env_variable = "SITE_ID_TOKEN"
  user_bash_env = ["EXAMPLE=var"]

data_dir - Required

[general]
  data_dir = "/ecp"

The data_dir is used as the base directory where all required build, cache, and script related contents are stored. Unlike a traditional runner builds/cache directory the data_dir will seek to enforce user ownership over the files by establishing a top level 700 permission directory by default, although this can also be configured to allow for a 750 permission directory if group access is desired. Additionally, if set to data_dir = "$HOME", jobs will be stored in the user specific home directory.

It should be noted that upon changing the group_permissions field in the configuration, as well as the accompanying feature flag, the data_dir will be deleted and recreated as a completely new directory with the appropriate permissions.

The easiest way to understand the data_dir setting is to examine its effect. In our example /ecp is the base directory and immediately proceeding it is the local user responsible for triggering the job’s individual folder:

$ namei -l $(pwd)
f: /ecp/user/builds/runnerShort/000/group/project
    drwxr-xr-x root    root     /
    drwx-----x root    root     ecp
    drwx------ user    user     user
    drwx------ user    user     builds
    drwxr-xr-x user    user     runnerShort
    drwxr-xr-x user    user     000
    drwxr-xr-x user    user     group
    drwxr-xr-x user    user     project

Jacamar is responsible for generating the /user/{builds/cache/script} directories. It is important to note that this is the responsibility of jacamar not jacamar-auth. The authorization process will identify all required directory paths but it becomes the responsibility of the user owned process to realize their creation. Strict rules regarding ownership and permissions are enforced for this directory creation process.

Important

It is not required to allow the user to generate their own base directory (/ecp/user in our example structure), in fact we understand it is desirable to have an administrative process create these automatically. Just ensure that the folder has proper ownership (user:user) and permissions (0700) or else the job will fail.

Note

When choosing a data_dir for a HPC scheduler executor type verify that the volume is mounted in the same way on the runner host system as it is on any of the available compute resources (source issue#155).

limit_build_dir

An issue with traditional data_dir can be observed with the structure of the builds_dir, specifically the inclusion of the runner short token in the generated path. When managing a small pool of runners that share the same data_dir this doesn’t present a major issue. However, the multiplying affect these folders have can become quickly apparent as you scale up the number of runners across machines/clusters.

To help address this the limit_build_dir has been introduced to offer a solution that avoids runner specific folders while still meeting the requirement that each CI job executes in its own unique directory. It is best to highlight this in action:

[general]
  data_dir = "/store/$USER/ci"
  limit_build_dir = true

Once enabled our process will utilize the data_dir and observe all rules surrounding permissions but instead of constructing a standardized path it will claim the next available concurrent directory using fcntl.

The resulting build directory will be created using the project name coupled with the ID and within that structure concurrent folders are managed:

$ pwd
/store/username/ci/builds/project-name_uniqueID

$ ls -a
.000.lock  .001.lock  .002.lock 000  001  002

The Jacamar application will utilize and observe these locks files during initial configuration in order to claim a concurrent ID regardless of the runner.

$ cat .004.lock | jq
{
  "job_id": "2424067",
  "expiration": "1714066690",
  "hostname": "example"
}

To the average user this won’t present any changes to their workflows; however, it does greatly alter the structure of the <data_dir>/builds directory. We strongly advise that if you utilize this feature you remove all existing build directories and start from scratch.

Key

Description

limit_build_dir

Enforces a limited structure on the builds_dir by creating a user driven process to automatically claim concurrent directories through file locking.

max_build_dir

Indicates how many concurrent build directories can be left on the system (default: 0, only limited by cumulative runner concurrency).

uncap_build_dir_cleanup

By default cleanup is limited to a single builds_dir in every job. This is to limit a CI job becoming “stuck” during clean_exec, during which we lack the ability to directly notify the user of any cleanup actions.

file_lock_debug

Create a log file that outlines all actions of the jacamar lock process occurring in userspace. This should only be used for troubleshooting potential errors with the process of generating/claiming file locks as there is no automated cleanup on these files.

user_enabled_limit

Only a user (via the JACAMAR_LIMITED_DIR: 1 variable) can utilize this feature. The ideal workflow would have this enabled in conjunction with the primary limit_build_dir to allow select users to test this feature at scale with existing infrastructure.

[general]
  limit_build_dir = true
  max_build_dir = 0
  uncap_build_dir_cleanup = false
  file_lock_debug = false
  user_enabled_limit = false

When deploying and testing this feature for the first time it may prove beneficial to enable the file_lock_debug option. This results in a folder (lock_debug/) appearing along side the concurrent directories and lock files. Within this each job will have details regarding the file locking process.

$ cat 2424067.json
{"level":"info","msg":"unable to lock file 0: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:49Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.001.lock has not expired","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"unable to lock file 2: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.003.lock has not expired","time":"2024-04-25T16:07:57Z"}
{"level":"info","msg":"file claimed with 1714066690 expiration on ci-test-2 host","time":"2024-04-25T16:08:10Z"}
{"level":"info","msg":"identified concurrent target: 004","time":"2024-04-25T16:08:15Z"}

Be aware this should only be used for testing/debugging purposes. It can also be used to couple that with the user_enabled_limit option. This will restrict the use of the feature to projects that explicitly opt-in. Providing a way to experiment at your scale without having to manage an additional set of deployments.

unrestricted_cmd_line

Important

We strongly advise only enabling this option when you know /proc has been mounted with hidepid oe else you will increase the risk of runner generated scripts exposing CI_JOB_TOKEN via command line.

[general]
  unrestricted_cmd_line = true

By default jacamar-auth takes steps to avoid cases where a job token could end up in runner defined scripts (e.g., when using Git or managing artifacts). This includes augmenting the runner generated scripts and leveraging GIT_ASKPASS. Coupled with the Git credentials script we also have to restrict the use of the credential store to avoid breaking by incorrectly storing the CI_JOB_TOKEN.

All this is only required if there exists a chance of a script, the user cannot control, could expose their job token in /proc. By default we do not plan to modify this behavior; however, for those that have decided to hide PID listings you can enable this setting.

One final note, this does not protect user generated scripts/actions, and they should always follow the best practices for your machine when interacting working on multi-tenant resources.

Signal Management

The GitLab custom executor allows for configurable durations on timeouts, most importantly the graceful_kill_timeout defaults to 10 minutes. This means that once a job is canceled jacamar-auth will have this time to gracefully terminate whatever processes it is currently running. However, due to the range of potential configurations relating to downscoping, jacamar-auth enforces its own separate timeout on the jacamar sub-process:

[general]
  kill_timeout = "120s"

The kill_timeout will start once jacamar-auth has intercepted a terminating signal the runner has generated and in turn passes that onto jacamar. Only, once this timeout has been encountered will SIGKILL be sent.

Important

Never set your runner’s graceful_kill_timeout configuration below that of Jacamar CI’s kill_timeout. In cases where the runner user downscopes permissions the jacamar-auth application takes special steps to ensure that the appropriate signal reaches the sub-process as permissions will likely prohibit a simple kill(2).

As a backup, any Jacamar CI application will also establish a self-imposed timeout of the job’s maximum duration plus 10 minutes.

[auth] - Table

Key

Description

downscope

Target downscoping mechanisms for execution of all CI scripts and generated commands through the auth mechanisms. When using jacamar-auth this is required.

jacamar_path

The full path to the Jacamar application, used in constructing the command for job execution. This can be used if it has been installed outside the user’s PATH.

max_env_chars

The maximum number of characters that can be defined per environment variable (default: 10000).

lists_pre_validation

Boolean indicates if the allow/block list rules should be observed prior to the execution of the RunAS validate script.

root_dir_creation

Indicate via boolean if the privileged Jacamar-Auth user should create the target CI user’s base data_dir (e.g., /data_dir/username) and assign permission via chown.

user_allowlist

An authoritative list of users who can execute CI jobs.

user_blocklist

A list of usernames that are not allowed to run CI jobs. More authoritative than group lists, but can be overridden by UserAllowlist.

groups_allowlist

A list of groups that are allowed to run CI jobs. Least authoritative.

groups_blocklist

A list of groups that are not allowed to run CI jobs.

shell_allowlist

If defined, an authoritative list of acceptable shells that for CI users as they are found in the user database.

pipeline_source_allowlist

If defined, an authoritative list of acceptable CI_PIPELINE_SOURCES that can result in local jobs. Value obtained through verified GitLab JWT.

jwt_exp_delay

Configurable duration string delay allowed in a JWT’s expiration in select cases to allow for automated cleanup actions (default 15m and maximum 1hr).

jwt_required_aud

Required audience (aud) when validating a JWT.

allow_bot_accounts

GitLab managed project bos accounts (i.e., project_{number}_bot) are disallowed by default.

no_new_privs

Enforces PR_SET_NO_NEW_PRIVS, to limit the sub-process from gaining additional privileges. Please note that this setting is redundant if seccomp is being used.

run_stage_allowlist

List of Run stages that are allowed, all other skipped with a warning to the user.

enforce_nologin

Indicates that jobs should be blocked during configuration if a pam_nologin <https://man7.org/linux/man-pages/man8/pam_nologin.8.html>_ file (/etc/nologin or /var/run/nologin) is encountered. The contents of this file will be presented to the user in their job log.

[Auth] represents authorization process configuration for approving any GitLab and local accounts. It is observed only and made available to the jacamar-auth application. For more details see the Authorization via Jacamar-Auth documentation.

[auth]
  downscope = "setuid"
  jacamar_path = "/custom/bin"
  max_env_chars = 10000
  lists_pre_validation = false
  root_dir_creation = true
  allow_bot_accounts = false
  jwt_exp_delay = "5m"
  no_new_privs = false
  enforce_nologin = true

  user_allowlist = ["usr1"]
  user_blocklist = ["usr2", "usr3"]
  groups_allowlist = ["grp1", "grp2"]
  groups_blocklist = ["grp3"]
  shell_allowlist = ["/bin/bash"]
  pipeline_source_allowlist = ["push", "web"]

[auth.runas] - Table

Key

Description

validation_script

Specify the path to a script where the local user and target service account can be validated. When using RunAs a script is required.

user_variable

Indicates the name of the CI variable a user can define to indicate their target service account.

sha256

Checksum of script, if provided will be verified shortly before execution.

validation_env

Manages a list of “key=value” strings that dictate additional context to the validation script. These will take lowest priority so avoid using the key for any existing RunAs or system environment variables.

Configuration of the RunAs portion of the authorization flow can offer administrative control over a transition between the CI user and a local account not known by GitLab. For additional details and workflow consideration see the RunAs authorization.

[auth.runas]
  validation_script = "/custom/run-validate.py"
  user_variable = "TARGET_SERVICE_USER"
  sha256 = "e258d248fda94c63753607f7c4494ee0fcbe92f1a76bfdac795c9d84101eb317"
  validation_env = ["DEBUG_ENV=1"]

[auth.logging] - Table

Key

Description

enabled

If the system logging for jacamar-auth should be used for all CI jobs that are processed.

location

Identifies where logs will be saved, this can be a distinct file or syslog (default). In the case of syslog, a connection to the log daemon will be established, targeting the local syslog server if related values are not specified.

level

Denotes the logging level (error, warn, info, or debug) of messages saved. Defaults to debug.

network

Used for dialing remote log daemon connections only (e.g., tcp).

address

Used for dialing remote log daemon connections only (e.g., localhost:1234).

Logging represents configuration of how the jacamar-auth application (ONLY) will log relevant job level information. This occurs in addition to any logging preformed by the GitLab runner and assumes that the user account responsible for launching jacamar-auth is provided with the necessary access to the local system log daemon or target file.

[auth.logging]
  enabled = true
  location = "syslog"
  level = "debug"

Note

Incorrectly configured logging will result in CI job failures during the initial configuration stage. Please be sure to test/verify any related configuration changes prior to deployment.

[auth.seccomp] - Table

Key

Description

disabled

Signal if system call filtering via libseccomp should be disabled, this includes all system defined defaults as well as administrative configurations. We advise only disabling if troubleshooting or under specific circumstances where security requirements are not as high.

block_calls

A list of blocked system calls that the jacamar-auth application will declare. Incorrectly defined calls will result in an error message being produced immediately upon job creation and should be troubleshooted prior to deployment of any configuration changes.

block_all

Globally blocks all system calls from being used, this requires reliance on a manually defined list of allow_calls for functionality.

allow_calls

List of system calls that will be allowed, this takes precedence over any manually (block_calls) or system defined blocked calls.

log_allowed_actions

Sets the default action for allowed system calls to log (audit) while still allowing their execution. This option creates a substantial number of logs and is only suited for dev/test environments.

disable_no_new_privs

Disables or prevents the application of PR_SET_NO_NEW_PRIVS based upon the usage of seccomp filters. This only applies when seccomp is enabled.

error_num_block_actions

Modifies the desired block actions and will return an error code rather than terminating the associated thread.

validation_plugin

Path to a Go plugin where the filter can be modified. Setting this value implies that plugin support should be enabled

The jacamar-auth application by default supports system call filtering through the libseccomp API. This added functionality can be found in versions 0.5.0+ of Jacamar CI. There are two distinct mechanisms by which specific syscalls are identified for filtering; administratively defined configurations and Default Filters established based upon supported downscoping mechanisms.

Note

Due to the nature of Jacamar CI’s architecture not all potential issues that are present in interactive applications are found here. However, we encourage that if you have concerns or recommendations you create a security issue for the Jacamar CI project.

[auth.seccomp]
  disabled = false
  block_calls = ["sethostname", "sendfile"]
  log_allowed_actions = false
  disable_no_new_privs = false

Block All By Default

Note

We do not currently have documented support for the known list of syscalls that must be allowed to support basic application functionality. Please use this for testing purposes only at this time.

The block_all option establishes a default filtering mechanism that blocks all syscalls regardless of potential conditions. This optional configuration will necessitate an administrator providing a list of allowable calls, otherwise every job will fail.

[auth.seccomp]
  block_all = true
  allow_calls = ["read", "write", "..."]

Default Filters

The application will attempt to define a meaningful yet limited set of default filters for select syscalls.

Note

Modifications planned for v0.12.0+ removed the remaining default filter, thus disabling seccomp by default for many deployments (see MR 351). For deployments utilizing a standard workflow from systemd/service this will be the equivalent of disabling seccomp with your current deployed version.

Optional Filters

Optionally enabled filters that can be configured in the [[auth.seccomp]] table.

Configuration

Filter Description

limit_setuid

Block any setuid or setgid call to the non-authorized UID/GID.

tty_rules

Block ioctl in conjunction with TIOCSTI.

[auth.seccomp]
  limit_setuid = true
  tty_rules = true

[batch] - Table

Key

Description

arguments_variable

An array of potential CI variables for user provided arguments in the job submission that are checked in order (default SCHEDULER_PARAMETERS is always present as a catch all).

command_delay

Meter interactions with schedulers via a duration string (default: 30s). We recommend leaving this at it’s default value unless specific concerns with your environment arise.

nfs_timeout

Largest possible delay to expect from NFS servers as a duration string (default: 30s). Due to the batch executors reliance on compute resources coupled with a network file system, providing too low a value can lead to job results not being correctly conveyed to the user.

scheduler_bin

Path to be observed as a prefix for all scheduler commands generated. Useful when default scheduler application on a user’s PATH can be incorrect.

env_vars

Array of key=value strings that are used when building job submission command (e.g., qsub).

allow_illegal_args

Do not cause job failures when a conflicting parameters

skip_cobalt_log

Identify that the job status found in the CobaltLog should be skipped in favor of an echo in the output file (Ideally for test/debug purposes only).

lsf_job_cancellation

Enables the use of bkill to signal a running job it’s time to stop based upon a runner generated SIGTERM.

default_args

List of arguments that will be injected into the job submission commands.

disable_name_prefix

Prevents a user defined name prefix (via SCHEDULER_JOB_PREFIX).

Configurations relating exclusively to the support batch scheduling systems (Cobalt, Flux, LSF, PBS, and Slurm). These will only be observed when a related executor is configured.

[batch]
  arguments_variable = [
      "NEW_SITE_PARAMETERS", "OLD_SITE_PARAMETERS"
  ]
  command_delay = "30s"
  nfs_timeout = "1m"
  scheduler_bin = "/usr/scheduler/bin"
  env_vars = [
      "GPU_ENABLED=true", "EXAMPLE_MODE=debug"
  ]
  allow_illegal_args = false
  lsf_job_cancellation = false
  default_args = ["--clusters=example"]
  disable_name_prefix = false

Feature Flags

Important

These optional configuration are primarily meant for testing/feedback and are subject to modification.

Table

Key

Description

[general]

ff_custom_data_dir

Allow users to specify their own data_dir via CI variables, supersedes custom_build_dir.

[batch]

ff_slurm_sacct

Enable secondary check using sacct to verify COMPLETED exit state upon job completion.

[batch]

ff_user_args

Improve shell quoting and reliability when generating job submission commands.

GitLab Runner Config

GitLab has organized documentation covering a number of topics relating to configuring a GitLab runner that we highly recommend you review. Details provided here are focused on aspects that relate directly to Jacamar or other HPC focused administration concerns.

# global
concurrent = 5

# runner specific
[[runners]]
  ...
  pre_clone_script = '''
    ml use /example/modules/Core && ml git
  '''
  output_limit = 10000
  executor = "custom"
  [runners.custom]
    config_exec = "/opt/jacamar/bin/jacamar-auth"
    config_args = ["config", "--configuration", "/jacamar.toml"]
    ...
    graceful_kill_timeout = 600

Beyond correctly configuring the custom executor there are other aspects of the runner’s config that are worth closer examination.

concurrent

concurrent = 5

A single config.toml can define multiple runners that can be registered with Gitlab. Each appears under a separate [[runners]] table. Regardless of the number of registered runners there is an upper limited defined by concurrent on the number of total jobs that can be running at any given time.

In our above example the limit is set to 5 however as an admin this can be altered to best fit the limitations of your CI environment.

Note

At this time there is no recommendation for a concurrent number established for HPC workloads. It may require experimentation but keep in mind that the runner is not a scheduler. As such it will not take into account the availability of local resources when running jobs.

output_limit

output_limit = 10000

The maximum build log size (in kilobytes) is defined on the runner level. Though this functionality exists in the upstream GitLab runner teams with larger build/test process have been likely to experience issues with default settings.

If a user’s output from a CI job exceeds the default limit (4MB) the job will fail and they will see the following error message: Job's log exceeded limit of 4194304 bytes.

Admin Defined Commands

pre_clone_script = '''
    module use /example/modules/Core && module load git
'''

The pre_clone_script is outlined in the GitLab’s advanced configuration documentation. However, it can be used to inject administrator defined commands into a user’s CI job at predefined points. In the above example we are leveraging LMOD to ensure the required version of Git is available.

Note

As specified in the requirements Git version 2.9+ needs to be available in order to use the enhanced runner. However, this newer version of Git is only technically required during the get_sources phase of a job. By leveraging the above method you can avoid installing a newer system wide version of Git.

The changes to the environment (module use pre-append to the MODULEPATH) will only be present during this get_sources phase of the job. Each subsequent phase of the jobs, including when the user defined scripts are executed, will occur in a clean environment.

In addition to the pre_clone_script there also exists options pre_build_script and post_build_script that work the same way.