Configurations

All aspects of the Jacamar CI configuration are documented here. In addition, key elements of the GitLab Runner that are related either to the custom executor or general concerns. In both cases the configurations are managed via the TOML format.

Jacamar CI Config

Due to the highly configurable nature of Jacamar CI, it requires its own configuration file. If new to this application we recommend that you first review the admin tutorial.

Unless explicitly noted there are no default string/integer values and unset booleans will be false.

`[general]` - Table

Key	Description
`executor`	A required setting that specifies which of the supported executors CI jobs will utilize.
`data_dir`	A required setting where all files/directories for a job are stored. Strict ownership (`user:user`) and permissions (`0700`) are enforced on top level directories.
`retain_logs`	Keep all files generated by the executor and/or scheduling mechanisms (default: `false`, removed upon job completion).
`custom_build_dir`	Observe the directory specified by a user’s `CUSTOM_CI_BUILDS_DIR` variable, this will replace any builds directory derived from the `data_dir`. Jacamar will ensure unique paths and appropriate permissions (700). Does not function in conjunction with `root_dir_creation`.
`name`	Administrator defined string, (currently) only appears in Jacamar’s system logging capabilities to help distinguish from other instances.
`kill_timeout`	Maximum timeout (duration string default: `120s`) the Jacamar-Auth application will wait before sending a `SIGKILL` to the underlying Jacamar process if a `SIGTERM` is captured from a custom executor termination.
`shell_path`	Shell path to be used when constructing Bash shell for job script execution, when not set will resolve based upon `PATH`.
`job_message`	Custom message that will be conveyed at the start of every `prepare_exec` stage to the user.
`gitlab_server`	Trusted URL for GitLab server used in all web interactions, takes priority over any values identified in the job response.
`tls-ca-file`	File containing the required certificates for HTTPS actions.
`unrestricted_cmd_line`	Allow for unfettered usages of tokens via the command line by all runner generated job scripts.
`static_builds_dir`	Create a static folder (`<data_dir>/user/statis/<job_id>`) that will be unique for every job and only removed based upon the `static_min_days` configuration.
`static_min_days`	Minimum number of days any static directory can remain (default: 7 with 0 indicating no cleanup required).This will only be enforced during job cleanup and may lead to longer than average job durations.
`group_permissions`	Set base permissions on Jacamar generated data directories to allow read and execute access for groups (ie, `0750` permission).
`jwt_env_variable`	Environment variable to be checked for an `id_token` (default: `CI_JOB_JWT`)
`set_stack_size`	Overrides behavior found in RHEL 8 that reverts the stack size to 8m when capabilities are used. This will cause the user’s environment to set the ulimit to match the available hard limit, normally configured through the runner’s systemd file.
`user_bash_env`	Can be used to define `key=value` pairs that will be injected into the user’s shell responsible for jobs script and monitoring command execution.

The [general] configuration is applicable to a range of features and can affect both the jacamar-auth and jacamar applications.

[general]
  executor = "shell"
  data_dir = "/ecp"
  retain_logs = true
  custom_build_dir = false
  name = "My Jacamar Driver"
  kill_timeout = "120s"
  job_message = """
  ****************************************************************************
                        NOTICE TO USERS

  This is an example message ....
  ****************************************************************************
  """
  gitlab_server = "https://gitlab.example.com"
  tls-ca-file = "/example/file.crt"
  unrestricted_cmd_line = false
  static_builds_dir = false
  static_min_days = 7
  group_permissions = false
  jwt_env_variable = "SITE_ID_TOKEN"
  user_bash_env = ["EXAMPLE=var"]

`data_dir` - Required

[general]
  data_dir = "/ecp"

The data_dir is used as the base directory where all required build, cache, and script related contents are stored. Unlike a traditional runner builds/cache directory the data_dir will seek to enforce user ownership over the files by establishing a top level 700 permission directory by default, although this can also be configured to allow for a 750 permission directory if group access is desired. Additionally, if set to data_dir = "$HOME", jobs will be stored in the user specific home directory.

It should be noted that upon changing the group_permissions field in the configuration, as well as the accompanying feature flag, the data_dir will be deleted and recreated as a completely new directory with the appropriate permissions.

The easiest way to understand the data_dir setting is to examine its effect. In our example /ecp is the base directory and immediately proceeding it is the local user responsible for triggering the job’s individual folder:

$ namei -l $(pwd)
f: /ecp/user/builds/runnerShort/000/group/project
    drwxr-xr-x root    root     /
    drwx-----x root    root     ecp
    drwx------ user    user     user
    drwx------ user    user     builds
    drwxr-xr-x user    user     runnerShort
    drwxr-xr-x user    user     000
    drwxr-xr-x user    user     group
    drwxr-xr-x user    user     project

Jacamar is responsible for generating the /user/{builds/cache/script} directories. It is important to note that this is the responsibility of jacamar not jacamar-auth. The authorization process will identify all required directory paths but it becomes the responsibility of the user owned process to realize their creation. Strict rules regarding ownership and permissions are enforced for this directory creation process.

Important

It is not required to allow the user to generate their own base directory (/ecp/user in our example structure), in fact we understand it is desirable to have an administrative process create these automatically. Just ensure that the folder has proper ownership (user:user) and permissions (0700) or else the job will fail.

Note

When choosing a data_dir for a HPC scheduler executor type verify that the volume is mounted in the same way on the runner host system as it is on any of the available compute resources (source issue#155).

`limit_build_dir`

An issue with traditional data_dir can be observed with the structure of the builds_dir, specifically the inclusion of the runner short token in the generated path. When managing a small pool of runners that share the same data_dir this doesn’t present a major issue. However, the multiplying affect these folders have can become quickly apparent as you scale up the number of runners across machines/clusters.

To help address this the limit_build_dir has been introduced to offer a solution that avoids runner specific folders while still meeting the requirement that each CI job executes in its own unique directory. It is best to highlight this in action:

[general]
  data_dir = "/store/$USER/ci"
  limit_build_dir = true

Once enabled our process will utilize the data_dir and observe all rules surrounding permissions but instead of constructing a standardized path it will claim the next available concurrent directory using fcntl.

The resulting build directory will be created using the project name coupled with the ID and within that structure concurrent folders are managed:

$ pwd
/store/username/ci/builds/project-name_uniqueID

$ ls -a
.000.lock  .001.lock  .002.lock 000  001  002

The Jacamar application will utilize and observe these locks files during initial configuration in order to claim a concurrent ID regardless of the runner.

$ cat .004.lock | jq
{
  "job_id": "2424067",
  "expiration": "1714066690",
  "hostname": "example"
}

To the average user this won’t present any changes to their workflows; however, it does greatly alter the structure of the <data_dir>/builds directory. We strongly advise that if you utilize this feature you remove all existing build directories and start from scratch.

Key	Description
`limit_build_dir`	Enforces a limited structure on the builds_dir by creating a user driven process to automatically claim concurrent directories through file locking.
`max_build_dir`	Indicates how many concurrent build directories can be left on the system (default: 0, only limited by cumulative runner concurrency).
`uncap_build_dir_cleanup`	By default cleanup is limited to a single builds_dir in every job. This is to limit a CI job becoming “stuck” during clean_exec, during which we lack the ability to directly notify the user of any cleanup actions.
`file_lock_debug`	Create a log file that outlines all actions of the `jacamar lock` process occurring in userspace. This should only be used for troubleshooting potential errors with the process of generating/claiming file locks as there is no automated cleanup on these files.
`user_enabled_limit`	Only a user (via the `JACAMAR_LIMITED_DIR: 1` variable) can utilize this feature. The ideal workflow would have this enabled in conjunction with the primary `limit_build_dir` to allow select users to test this feature at scale with existing infrastructure.

[general]
  limit_build_dir = true
  max_build_dir = 0
  uncap_build_dir_cleanup = false
  file_lock_debug = false
  user_enabled_limit = false

When deploying and testing this feature for the first time it may prove beneficial to enable the file_lock_debug option. This results in a folder (lock_debug/) appearing along side the concurrent directories and lock files. Within this each job will have details regarding the file locking process.

$ cat 2424067.json
{"level":"info","msg":"unable to lock file 0: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:49Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.001.lock has not expired","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"unable to lock file 2: fcntl syscall error: resource temporarily unavailable","time":"2024-04-25T16:07:50Z"}
{"level":"info","msg":"lock file /ci/username/builds/project-name_uniqueID/.003.lock has not expired","time":"2024-04-25T16:07:57Z"}
{"level":"info","msg":"file claimed with 1714066690 expiration on ci-test-2 host","time":"2024-04-25T16:08:10Z"}
{"level":"info","msg":"identified concurrent target: 004","time":"2024-04-25T16:08:15Z"}

Be aware this should only be used for testing/debugging purposes. It can also be used to couple that with the user_enabled_limit option. This will restrict the use of the feature to projects that explicitly opt-in. Providing a way to experiment at your scale without having to manage an additional set of deployments.

`unrestricted_cmd_line`

Important

We strongly advise only enabling this option when you know /proc has been mounted with hidepid oe else you will increase the risk of runner generated scripts exposing CI_JOB_TOKEN via command line.

[general]
  unrestricted_cmd_line = true

By default jacamar-auth takes steps to avoid cases where a job token could end up in runner defined scripts (e.g., when using Git or managing artifacts). This includes augmenting the runner generated scripts and leveraging GIT_ASKPASS. Coupled with the Git credentials script we also have to restrict the use of the credential store to avoid breaking by incorrectly storing the CI_JOB_TOKEN.

All this is only required if there exists a chance of a script, the user cannot control, could expose their job token in /proc. By default we do not plan to modify this behavior; however, for those that have decided to hide PID listings you can enable this setting.

One final note, this does not protect user generated scripts/actions, and they should always follow the best practices for your machine when interacting working on multi-tenant resources.

Signal Management

The GitLab custom executor allows for configurable durations on timeouts, most importantly the graceful_kill_timeout defaults to 10 minutes. This means that once a job is canceled jacamar-auth will have this time to gracefully terminate whatever processes it is currently running. However, due to the range of potential configurations relating to downscoping, jacamar-auth enforces its own separate timeout on the jacamar sub-process:

[general]
  kill_timeout = "120s"

The kill_timeout will start once jacamar-auth has intercepted a terminating signal the runner has generated and in turn passes that onto jacamar. Only, once this timeout has been encountered will SIGKILL be sent.

Important

Never set your runner’s graceful_kill_timeout configuration below that of Jacamar CI’s kill_timeout. In cases where the runner user downscopes permissions the jacamar-auth application takes special steps to ensure that the appropriate signal reaches the sub-process as permissions will likely prohibit a simple kill(2).

As a backup, any Jacamar CI application will also establish a self-imposed timeout of the job’s maximum duration plus 10 minutes.

`[auth]` - Table

Key	Description
`downscope`	Target downscoping mechanisms for execution of all CI scripts and generated commands through the auth mechanisms. When using `jacamar-auth` this is required.
`jacamar_path`	The full path to the Jacamar application, used in constructing the command for job execution. This can be used if it has been installed outside the user’s `PATH`.
`max_env_chars`	The maximum number of characters that can be defined per environment variable (default: `10000`).
`lists_pre_validation`	Boolean indicates if the allow/block list rules should be observed prior to the execution of the RunAS validate script.
`root_dir_creation`	Indicate via boolean if the privileged Jacamar-Auth user should create the target CI user’s base `data_dir` (e.g., `/data_dir/username`) and assign permission via `chown`.
`user_allowlist`	An authoritative list of users who can execute CI jobs.
`user_blocklist`	A list of usernames that are not allowed to run CI jobs. More authoritative than group lists, but can be overridden by UserAllowlist.
`groups_allowlist`	A list of groups that are allowed to run CI jobs. Least authoritative.
`groups_blocklist`	A list of groups that are not allowed to run CI jobs.
`shell_allowlist`	If defined, an authoritative list of acceptable shells that for CI users as they are found in the user database.
`pipeline_source_allowlist`	If defined, an authoritative list of acceptable CI_PIPELINE_SOURCES that can result in local jobs. Value obtained through verified GitLab JWT.
`jwt_exp_delay`	Configurable duration string delay allowed in a JWT’s expiration in select cases to allow for automated cleanup actions (default `15m` and maximum `1hr`).
`jwt_required_aud`	Required audience (aud) when validating a JWT.
`allow_bot_accounts`	GitLab managed project bos accounts (i.e., project_{number}_bot) are disallowed by default.
`no_new_privs`	Enforces PR_SET_NO_NEW_PRIVS, to limit the sub-process from gaining additional privileges. Please note that this setting is redundant if seccomp is being used.
`run_stage_allowlist`	List of Run stages that are allowed, all other skipped with a warning to the user.
`enforce_nologin`	Indicates that jobs should be blocked during configuration if a pam_nologin <https://man7.org/linux/man-pages/man8/pam_nologin.8.html>_ file (`/etc/nologin` or `/var/run/nologin`) is encountered. The contents of this file will be presented to the user in their job log.
`skip_exit_code_file`	Prevent `jacamar-auth` from attempting to retrieve the real exit_code file from userspace. Reverts errors codes back to the generic build failure.

[Auth] represents authorization process configuration for approving any GitLab and local accounts. It is observed only and made available to the jacamar-auth application. For more details see the Authorization via Jacamar-Auth documentation.

[auth]
  downscope = "setuid"
  jacamar_path = "/custom/bin"
  max_env_chars = 10000
  lists_pre_validation = false
  root_dir_creation = true
  allow_bot_accounts = false
  jwt_exp_delay = "5m"
  no_new_privs = false
  enforce_nologin = true
  skip_exit_code_file = false

  user_allowlist = ["usr1"]
  user_blocklist = ["usr2", "usr3"]
  groups_allowlist = ["grp1", "grp2"]
  groups_blocklist = ["grp3"]
  shell_allowlist = ["/bin/bash"]
  pipeline_source_allowlist = ["push", "web"]

`[auth.runas]` - Table

Key	Description
`validation_script`	Specify the path to a script where the local user and target service account can be validated. When using RunAs a script is required.
`user_variable`	Indicates the name of the CI variable a user can define to indicate their target service account.
`sha256`	Checksum of script, if provided will be verified shortly before execution.
`validation_env`	Manages a list of “key=value” strings that dictate additional context to the validation script. These will take lowest priority so avoid using the key for any existing RunAs or system environment variables.

Configuration of the RunAs portion of the authorization flow can offer administrative control over a transition between the CI user and a local account not known by GitLab. For additional details and workflow consideration see the RunAs authorization.

[auth.runas]
  validation_script = "/custom/run-validate.py"
  user_variable = "TARGET_SERVICE_USER"
  sha256 = "e258d248fda94c63753607f7c4494ee0fcbe92f1a76bfdac795c9d84101eb317"
  validation_env = ["DEBUG_ENV=1"]

`[auth.logging]` - Table

Key	Description
`enabled`	If the system logging for `jacamar-auth` should be used for all CI jobs that are processed.
`location`	Identifies where logs will be saved, this can be a distinct file or `syslog` (default). In the case of syslog, a connection to the log daemon will be established, targeting the local syslog server if related values are not specified.
`level`	Denotes the logging level (`error`, `warn`, `info`, or `debug`) of messages saved. Defaults to `debug`.
`network`	Used for dialing remote log daemon connections only (e.g., `tcp`).
`address`	Used for dialing remote log daemon connections only (e.g., `localhost:1234`).

Logging represents configuration of how the jacamar-auth application (ONLY) will log relevant job level information. This occurs in addition to any logging preformed by the GitLab runner and assumes that the user account responsible for launching jacamar-auth is provided with the necessary access to the local system log daemon or target file.

[auth.logging]
  enabled = true
  location = "syslog"
  level = "debug"

Note

Incorrectly configured logging will result in CI job failures during the initial configuration stage. Please be sure to test/verify any related configuration changes prior to deployment.

`[auth.seccomp]` - Table

Key	Description
`disabled`	Signal if system call filtering via libseccomp should be disabled, this includes all system defined defaults as well as administrative configurations. We advise only disabling if troubleshooting or under specific circumstances where security requirements are not as high.
`block_calls`	A list of blocked system calls that the `jacamar-auth` application will declare. Incorrectly defined calls will result in an error message being produced immediately upon job creation and should be troubleshooted prior to deployment of any configuration changes.
`block_all`	Globally blocks all system calls from being used, this requires reliance on a manually defined list of `allow_calls` for functionality.
`allow_calls`	List of system calls that will be allowed, this takes precedence over any manually (`block_calls`) or system defined blocked calls.
`log_allowed_actions`	Sets the default action for allowed system calls to log (audit) while still allowing their execution. This option creates a substantial number of logs and is only suited for dev/test environments.
`disable_no_new_privs`	Disables or prevents the application of PR_SET_NO_NEW_PRIVS based upon the usage of seccomp filters. This only applies when seccomp is enabled.
`error_num_block_actions`	Modifies the desired block actions and will return an error code rather than terminating the associated thread.
`validation_plugin`	Path to a Go plugin where the filter can be modified. Setting this value implies that plugin support should be enabled

The jacamar-auth application by default supports system call filtering through the libseccomp API. This added functionality can be found in versions 0.5.0+ of Jacamar CI. There are two distinct mechanisms by which specific syscalls are identified for filtering; administratively defined configurations and Default Filters established based upon supported downscoping mechanisms.

Note

Due to the nature of Jacamar CI’s architecture not all potential issues that are present in interactive applications are found here. However, we encourage that if you have concerns or recommendations you create a security issue for the Jacamar CI project.

[auth.seccomp]
  disabled = false
  block_calls = ["sethostname", "sendfile"]
  log_allowed_actions = false
  disable_no_new_privs = false

Block All By Default

Note

We do not currently have documented support for the known list of syscalls that must be allowed to support basic application functionality. Please use this for testing purposes only at this time.

The block_all option establishes a default filtering mechanism that blocks all syscalls regardless of potential conditions. This optional configuration will necessitate an administrator providing a list of allowable calls, otherwise every job will fail.

[auth.seccomp]
  block_all = true
  allow_calls = ["read", "write", "..."]

Default Filters

The application will attempt to define a meaningful yet limited set of default filters for select syscalls.

Note

Modifications planned for v0.12.0+ removed the remaining default filter, thus disabling seccomp by default for many deployments (see MR 351). For deployments utilizing a standard workflow from systemd/service this will be the equivalent of disabling seccomp with your current deployed version.

Optional Filters

Optionally enabled filters that can be configured in the [[auth.seccomp]] table.

Configuration	Filter Description
`limit_setuid`	Block any setuid or setgid call to the non-authorized UID/GID.
`tty_rules`	Block ioctl in conjunction with `TIOCSTI`.

[auth.seccomp]
  limit_setuid = true
  tty_rules = true

`[batch]` - Table

Key	Description
`arguments_variable`	An array of potential CI variables for user provided arguments in the job submission that are checked in order (default `SCHEDULER_PARAMETERS` is always present as a catch all).
`command_delay`	Meter interactions with schedulers via a duration string (default: `30s`). We recommend leaving this at it’s default value unless specific concerns with your environment arise.
`nfs_timeout`	Largest possible delay to expect from NFS servers as a duration string (default: `30s`). Due to the batch executors reliance on compute resources coupled with a network file system, providing too low a value can lead to job results not being correctly conveyed to the user.
`scheduler_bin`	Path to be observed as a prefix for all scheduler commands generated. Useful when default scheduler application on a user’s `PATH` can be incorrect.
`env_vars`	Array of key=value strings that are used when building job submission command (e.g., `qsub`).
`allow_illegal_args`	Do not cause job failures when a conflicting parameters
`skip_cobalt_log`	Identify that the job status found in the CobaltLog should be skipped in favor of an echo in the output file (Ideally for test/debug purposes only).
`lsf_job_cancellation`	Enables the use of `bkill` to signal a running job it’s time to stop based upon a runner generated SIGTERM.
`default_args`	List of arguments that will be injected into the job submission commands.
`disable_name_prefix`	Prevents a user defined name prefix (via `SCHEDULER_JOB_PREFIX`).

Configurations relating exclusively to the support batch scheduling systems (Cobalt, Flux, LSF, PBS, and Slurm). These will only be observed when a related executor is configured.

[batch]
  arguments_variable = [
      "NEW_SITE_PARAMETERS", "OLD_SITE_PARAMETERS"
  ]
  command_delay = "30s"
  nfs_timeout = "1m"
  scheduler_bin = "/usr/scheduler/bin"
  env_vars = [
      "GPU_ENABLED=true", "EXAMPLE_MODE=debug"
  ]
  allow_illegal_args = false
  lsf_job_cancellation = false
  default_args = ["--clusters=example"]
  disable_name_prefix = false

Feature Flags

Important

These optional configuration are primarily meant for testing/feedback and are subject to modification.

Table	Key	Description
`[general]`	`ff_custom_data_dir`	Allow users to specify their own `data_dir` via CI variables, supersedes `custom_build_dir`.
`[batch]`	`ff_user_args`	Improve shell quoting and reliability when generating job submission commands.

GitLab Runner Config

GitLab has organized documentation covering a number of topics relating to configuring a GitLab runner that we highly recommend you review. Details provided here are focused on aspects that relate directly to Jacamar or other HPC focused administration concerns.

# global
concurrent = 5

# runner specific
[[runners]]
  ...
  pre_clone_script = '''
    ml use /example/modules/Core && ml git
  '''
  output_limit = 10000
  executor = "custom"
  [runners.custom]
    config_exec = "/opt/jacamar/bin/jacamar-auth"
    config_args = ["config", "--configuration", "/jacamar.toml"]
    ...
    graceful_kill_timeout = 600

Beyond correctly configuring the custom executor there are other aspects of the runner’s config that are worth closer examination.

`concurrent`

concurrent = 5

A single config.toml can define multiple runners that can be registered with Gitlab. Each appears under a separate [[runners]] table. Regardless of the number of registered runners there is an upper limited defined by concurrent on the number of total jobs that can be running at any given time.

In our above example the limit is set to 5 however as an admin this can be altered to best fit the limitations of your CI environment.

Note

At this time there is no recommendation for a concurrent number established for HPC workloads. It may require experimentation but keep in mind that the runner is not a scheduler. As such it will not take into account the availability of local resources when running jobs.

`output_limit`

output_limit = 10000

The maximum build log size (in kilobytes) is defined on the runner level. Though this functionality exists in the upstream GitLab runner teams with larger build/test process have been likely to experience issues with default settings.

If a user’s output from a CI job exceeds the default limit (4MB) the job will fail and they will see the following error message: Job's log exceeded limit of 4194304 bytes.

Admin Defined Commands

pre_clone_script = '''
    module use /example/modules/Core && module load git
'''

The pre_clone_script is outlined in the GitLab’s advanced configuration documentation. However, it can be used to inject administrator defined commands into a user’s CI job at predefined points. In the above example we are leveraging LMOD to ensure the required version of Git is available.

Note

As specified in the requirements Git version 2.9+ needs to be available in order to use the enhanced runner. However, this newer version of Git is only technically required during the get_sources phase of a job. By leveraging the above method you can avoid installing a newer system wide version of Git.

The changes to the environment (module use pre-append to the MODULEPATH) will only be present during this get_sources phase of the job. Each subsequent phase of the jobs, including when the user defined scripts are executed, will occur in a clean environment.

In addition to the pre_clone_script there also exists options pre_build_script and post_build_script that work the same way.

Configurations

Jacamar CI Config

[general] - Table

data_dir - Required

limit_build_dir

unrestricted_cmd_line

Signal Management

[auth] - Table

[auth.runas] - Table

[auth.logging] - Table

[auth.seccomp] - Table

Block All By Default

Default Filters

Optional Filters

[batch] - Table

Feature Flags

GitLab Runner Config

concurrent

output_limit

Admin Defined Commands

`[general]` - Table

`data_dir` - Required

`limit_build_dir`

`unrestricted_cmd_line`

`[auth]` - Table

`[auth.runas]` - Table

`[auth.logging]` - Table

`[auth.seccomp]` - Table

`[batch]` - Table

`concurrent`

`output_limit`