Tractor 2.x Features
New features in Tractor 2 extend the core Tractor system established in the 1.x family of releases. There are a broad range of new additions and improvements, from productive new command line and scripting interfaces for wranglers, to simple user interface changes. Internal upgrades range from a new high-performance, high-capability job database, to new studio-wide resource sharing and allocation controls. Please refer to the guidelines described in Upgrading.
Here are some highlights:
Tractor Product Layout -- Single Release Directory, Single Download per platform, Bundled Subsystem Updates -- The Tractor 2.x packaging and installation layout includes a matched set of Tractor components all in one download: engine, blade, spooler, user interfaces, and scripting tools. They are all installed together in one versioned area, along with only one copy of matched shared resources including pre-built versions of several third-party subsystems.
Tractor Query Tools -- Introducing tq the tractor query command line tool and modules. Based on proven Pixar studio tools, tq is the best way to query live or historical Tractor data from your terminal shell, from your Python scripts, or from a new tab in the Dashboard.
Adaptive Farm Allocations -- A way to dynamically allocate abstract resources between people or projects using Tractor's flexible Limits system. If two films are in production, 60% of the farm can be allocated to one of them, 30% to the other, leaving the remaining 10% for other projects. If one show is idle the others can temporarily expand their shares, then shrink back to the nominal levels when all projects are active.
Dispatching Tiers -- A simple way to organize broad sets of jobs into a descending set of site-defined priority groups. The default tiers are named: admin, rush, default, batch. Create your own!
Custom Menu Actions -- Add site-defined Dashboard menu items that can invoke your own centralized scripts, parameterized by the user's current list selection.
Job Authoring API -- A new tractor.api.author module allows your Python scripts to easily create Job, Task, and Command objects linked together according to your dependency requirements. The Job.spool() method then sends the resulting job to the tractor-engine job queue.
Simple Engine Discovery -- A simple "zero-config" announcement capability for small studios allows tractor-blades and other Tractor tools to find tractor-engine on the local network without requiring manual nameserver (DNS/LDAP) configuration changes. Tractor-engine will automatically disable this SSDP-style traffic at studios where the hostname alias "tractor-engine" has already been created by an administrator in the site nameserver database.
Checkpoint-Resume Support -- Extensions to job scripting, dispatching, and the Dashboard add interesting new capabilities related to incremental computation. Tractor also supports a general "retry from last checkpoint" scheme. Both features integrate with the new RenderMan 19 RIS checkpoint and incremental rendering features.
Blade Auto-Update -- A simple tractor-blade patch management system allows administrators to "pin" the farm to a particular blade patch version, and automatically push out a new version to the entire farm. Out of date blades restart themselves using the new module version.
Pluggable Authentication Module (PAM) support -- The engine's optional new built-in PAM support delegates password validation directly to the operating system on the engine host. This alternative makes it simple to enable password support at studios where the LAN already provides adequate credential transport security.
Privilege Insulation -- The EngineOwner setting in tractor.config specifies the login identity under which tractor-engine should operate. This setting is important because it allows the engine to drop root privileges immediately after it has acquired any protected network ports that it may need. The engine's normal day-after-day operations will then occur under the restrictions associated with the specified login name.
Dynamic Service Key Advertisement -- Several blade profile "Provides" modes have been added to support some advanced service key use cases. For example, blades can dynamically advertise a different set of service key capabilities depending on which keys have already been "consumed" by previously launched commands.
Resource Usage Tracking -- The operating system rusage process metrics CPU, RSS, and VSZ are now recorded into the job database for each launched command. Currently supported on Linux and OSX tractor-blades.
Command Retry History -- A unique tracking record is now created in the job database for every command launch attempt. So the history of retries on a given task can be reviewed using the tq tool, for example.
Configuration File Loading -- A streamlined override system can help to reduce clutter and improve clarity about which files have been modified from their original "factory settings" at your studio.
Task Concurrency Throttle -- Each job can specify a "maxActive" attribute to constrain the number of concurrently running tasks from that job. This optional control over the This quick wrangling control over a job's "footprint" size on the farm can be useful when changing the full site-wide Limits settings is not appropriate.
Automatic Blade Error Throttle -- This blade profile setting will prevent blades from picking up new work if they encounter too many errors within a given time interval.
Job Spooling Improvements -- Job processing upgrades include faster processing, better error checking, and bundling of required subsystems. A parallelized job intake and database staging scheme can dramatically reduce backlogs when many jobs are spooled simultaneously, or when many "expand" tasks are running in parallel. A self-contained Tcl interpreter bundled with the spooler simplifies site install requirements and can perform client-side error checking prior to job delivery to the engine. A new JSON job spooing format is also supported (but not available prior to beta-1 pending changes).
Bug fixes and improvements
Tractor 2.4 is an update focused on overall performance, and handling large scale farm size, job size, and number of concurrently connected user sessions. Changes include:
A significant code refactoring effort addressed several internal thread contention bottlenecks, including those related to frequent password checks and identity management. The new logic results in improved throughput, especially on very large farms (5000+ blades) where many dashboard sessions and automated scripts (1000+) are accessing job status.
The number handler threads for running custom menu item backend scripts now scales up along with other thread pools based on tractor.config settings.
Fix for a string handling issue that could result in intermittent loss of access to task log output.
Fixed mismatched version strings in some of the system service start up scripts.
Fixed the default MAYA_LOCATION on Mac OSX for renders using the recently introduced "rfm" environment key patterns like "rfm-RRR-maya-MMMM" (as distinct from similar "rms-" keys). The environment key causes tractor-blade to set up various enviroment variables before it launches each command. These can include things like MAYA_LOCATION and PATH. There are built-in key handlers as well as custom handlers defined by each studio. In this case, a key like "rfm-21.0-maya-2016.5" should have caused shared.macosx.envkeys to extend PATH with
/Applications/Autodesk/maya2016.5/Maya.app/Contents/bin
Fixed a blade status probe problem in some cases where a single blade was running a large number of commands concurrently.
Fixed a tractor-blade issue that caused task launches to fail for jobs submitted by users with the '@' character in their login name. (Not their email address, but their unix login user name.)
Adjusted the log severity level of the 'limit max set to zero?' diagnostic. It now only occurs at TRACE level as a debugging hint. It is acceptable for limits to have a maximum sometimes set to zero.
Added job service key expression support for blade selection based on "total physical RAM". For example the expression
RemoteCmd {prman my.rib} -service "PixarRender && @.totalmem > 24"
selects blades that provide the "PixarRender" service (blade.config) and which have at least 24 gigabytes of RAM installed. The previously supported "@.mem" key for "available free RAM" is also still available.
Enabled access to BladeUse attributes in 'tq blades' queries: taskcount, slotsinuse, and owners.
Added --user option to the logcleaner utility script so that a different user to query jobs can be used from the process owner which performs the file removal.
Fixed the --add and --remove operations in the tq jattr and cattr commands for making relative changes to job and command attributes that are lists.
Addressed a tractor-engine socket exception handling issue on Linux for cases where a tractor-blade host (operating system) has become unresponsive, such as in cases of GPU driver or OOM issues or a kernel panic. The tractor-engine process would sometimes exhibit high cpu load in these cases, spinning in the socket handler.
Fixed the access-denied advisory text in JSON responses to retry, skip, and job interrupt URL requets.
Suggested workaround for RHEL6 PAM-related file descriptor leak:
On Linux RHEL 6.x era releases, the pam_fprintd.so module contains a bug causing it to leak file descriptors on every call from tractor-engine. Since PAM modules are loaded into the tractor-engine process, and it performs many authentications over time, the unclosed "pipe" descriptors will accumulate, unknown to the main tractor-engine code and will eventually exhaust the available file descriptor limit for that engine process. While many studios do not depend on fingerprint validation, especially for scripted API access to a system service, the "fprint" module is called indirectly from many common RHEL6 PAM policies, including "login" and "su". It has been removed from the common policies in RHEL 7 era distributions. A workaround for RHEL6 is to create your own "tractor" policy that doesn't include system-auth, or perhaps to specify a less general policy in crews.config, such as password-auth.
Features
GPU Exclusion Patterns can now be specified in the blade.config "ProfileDefaults" block. This setting is used to exclude certain GPU types from being used during service matching and counting consideration. Note that this keyword is only valid in the ProfileDefaults dictionary, and GPU filtering is performed prior to any per-profile match testing. Background: A given profile can match specific hosts based on several criteria in the "Hosts" clause, these can include the GPU type and count available on that host. Some hosts contain multiple GPUs including "uninteresting" virtual or underpowered GPUs that should always be excluded from consideration, prior to the profile matching pass. Use "GPUExclusionPatterns" to enumerate the makes/models of GPUs to be skipped in counts and matches. Each item in the list is a simple "glob-style" wildcard pattern, and patterns without '*' or '?' will be treated as if "TEXT" was given. See the new stock blade.config for an example:
"GPUExclusionPatterns": ["QX?", "paravirtual", "RV3*"]
Added a new --allow parameter to "tq" to control custom nimby modes on selected blades. Usage is like tq nimby --allowMODE BLADE_SELECTION, for example:
tq nimby --allow santa,yoda name like rack42
In that example, blades running on machines with hostnames containing "rack42" (like "rack42-01" or "rack42-g11") will only accept tasks from jobs owned by either santa or yoda.
Added a new tq blade query modifier called "registered" to select blades that have re-registered with the engine since last having been "cleared" from the blade list display. Now, by default, only registered blades are shown in tq commands, matching the same blades that are visible in the dashboard. Older cleared blades (not re-registered) can still be listed with the tq -a/--archives option.
Tractor-blade now adds its current path to command output logs in the case where the command exec itself fails due to executable not found.
A new blade.config setting "SubstituteJobCWD" can mandate that a single working directory should be used for all command launches from all jobs, on blades using a given profile.
Added a new administrative tractor-dbctl "--exec-sql" option for diagnostic and maintenance use; it uses db.config for connection details.
Fixes and Optimizations
Features
Fixes
Optimizations
Tractor 2.2 includes significant updates to several internal systems to improve user experience, correctness, and overall performance. There have been a variety of other features and updates as well.
A complete shutdown of the engine and job database will be required for the transition to 2.2. Simply starting the 2.2 engine will cause the old database to be upgraded to the new format automatically. Note that this is a one-way data migration, please backup your data area prior to restarting with 2.2 in order to retain 2.1 format data, should you need to revert to the older engine. Any tasks left running on the farm when the older engine is shutdown will continue to execute, and older tractor-blade processes will re-register with the new engine when it is started. New 2.2 tractor-blade servers can be started later, all at once or on a rolling basis; also the blade.config VersionPin mechanism can be used to automatically upgrade blades as they finish their current tasks.
Features
Fixes
Optimizations
Dashboard Job Notes -- A new Notes field has been added to the Dashboard job details pane, allowing text annotations to be added to any job. Notes are visible to other users, and the presence of a note is indicated with a small "chat bubble" icon in the job list. These notes can be used to describe a problem to wranglers, or to explain why a job needs, or is getting, special handling. The engine will automatically add a note to a job when an attribute is changed through some user action, such as altering priority, so the notes become a history of changes to the job.
Dashboard Blade Notes -- A new Notes field has been added to the Dashboard blade details pane, allowing text annotations to be attached to a blade entry. These notes can be used by system administrators to describe known issues or to discuss ongoing admin work on a machine.
Dashboard Job Pins -- Individual jobs in each user's job list can now be "pinned" to the top of the list, independent of the global list sorting mode. Jobs might be pinned because they are important to track or just because they represent a current "working set" of jobs. The group of pinned jobs float at the top of the list, and they are sorted according to the overall list sorting mode, within the pinned group.
Dashboard Job Locks -- A single user can now "lock" a job from the Dashboard. A locked job can only be modified by the user who locked it. Locks are typically only used by wranglers who are investigating a problem and who want to prevent other users from changing, restarting, or deleting a job while the investigation is proceeding. The lock owner can unlock the job when done. Permission to apply a lock is controlled by the JobEditAccessPolicies "lock" attribute in crews.config.
Task Logs 'L' Hotkey -- When navigating the tasks within a job, the logs for the currently selected task can be display by pressing the 'L' key. The key is a toggle, so pressing 'L' again will close the currently open log.
User-centric Job Shuffle - Individual users can re-order their own jobs on the queue without disrupting global priority settings. The dashboard job list option "Shuffle Job To Top" essentially exchanges the "place in line" of the selected job with a job submitted earlier from the same user, causing the selected job to run sooner than it would in the default submission order. This swap does not affect the ordering of other jobs on the queue, relative to the submission slots already held by that user. This slightly unusual feature is a simplified re-implementation of the old per-user dispatching order controls in Alfred, as requested by several customers. Permission to perform this kind of reordering is controlled by the JobEditAccessPolicies "jshuffle" attribute in crews.config.
The "project" affiliations for each job are now displayed in the job list view.
"Delete Job" action is now called "Archive Job" -- The former "Delete Job" menu item has been changed to "Archive Job" to better reflect its actual function: when the db.config setting "DBArchiving" is enabled, jobs that are removed from the active queue are transfered to an archive database where they can still be inspected and searched in tq queries. If DBArchiving is False, then "deleted" jobs are actually deleted and their database entries are removed -- in this case the dashboard menu item still says "Delete Job".
Archived Jobs View -- A Dashboard view of previously "deleted" (archived) jobs is now available. This view is analogous to a "trash can view" in some file browsers or e-mail clients. Jobs listed in the archive view can be browsed, and can also be restored to the main job queue where they can again be considered for dispatching. Note that jobs can sometimes contain "clean-up" commands that execute when they finish executing. These clean-ups may remove important temporary files that can make it impossible to re-execute that job.
Task progress bars for Nuke renders -- Tractor-blade now triggers a Dashboard progress bar update when it encounters a multi-frame progress message from Nuke, of the form "Frame 42 (7 of 9)".
Task Elapsed Time Bounds -- Job authors can now specify an acceptable elapsed time range for a given launched command. Commands whose elapsed time is outside the acceptable range will be marked as an error. Commands that run past the maximum time boundary will be killed. Example job script syntax:
RemoteCmd {sleep 15} -service PixarRender -minrunsecs 5 -maxrunsecs 20
Per-Tier Scheduling -- A new extension to the DispatchTiers specification in tractor.config allows each defined tier to have its own scheduling mode. For example, the "rush" tier might be schedule in a strict FIFO order, whereas the default mode might be one of the modes that favors shared-access (like P+ATCL+RR). Tiers can be assigned the new "P+CHKPT" mode to take advantage of partial-graph looping feature in Tractor 2.0; and tiers using that mode should be placed before tiers receiving "classic" non-checkpoint jobs.
Site-define Task Log Filters -- A new FilterSubprocessOutputLine() method is now available as an advanced customization feature in the TractorSiteStatusFilter module. This method provides python access to every line of task output. The site-written code can perform arbitrary actions in response to task output, and built-in Tractor-specific actions are also available. These include marking the task as an error, generating percent-done progress updates, initiating a task graph "expand" action, and stripping the output line from the logs.
GPU Detection -- On start-up, tractor-blade now makes an attempt to enumerate any GPU devices installed on the blade host. The device model and vendor name "labels" are made available during the profile selection process so that groups of blades can be categorized by the presence or type of GPU, if desired. The "Hosts" dictionary in a blade.config profile definition defines the matching criteria for that profile. Two new optional keys are now available: the "MinNGPU" entry specifies minimum number of GPU devices required for a match; and "GPU.label" specifies a wildcard-style matching string for a particular vendor/model. This label string also now appears in the Dashboard blade list, if a GPU device is found.
The new tractor.config setting "CmdAutoRetryStopCodes" specifies a list of exit codes that will be considered "terminal" -- automatic retries will NOT be considered for commands that exit with these codes, unless the -retryrc list for a specific command requests it. Negative numbers represent unix signal values, and the codes 10110 and 10111 are generated when a command's elapsed time falls outside the new run-time bounds options, when given. The default setting for the no-retry stop codes are the values for SIGTERM, SIGKILL, and the two time-bounds codes:
"CmdAutoRetryStopCodes": [-9, -15, 10110, 10111],
Engine statistics query -- A new URL request (Tractor/monitor?q=statistics) has been added to help integrate tractor-engine performance metrics with other site-wide monitoring systems. The returned JSON object contains the most recent sample of several statics that the engine collects about itself. This data might be used, for example, to populate an external site monitoring system. Some monitoring systems are able to make this URL request for data directly, while others may require a small data source script to be written that requests the JSON statistics report and then forwards each value of interest to the monitoring system separately.
TR_EXIT_STATUS auto-terminate policy change -- the default behavior for the TR_EXIT_STATUS handler has now reverted to the 1.x and earlier 2.x behavior in which the status value is simply recorded and then reported later when the command actually exits. The more recent behavior in which the blade actively kills the app upon receipt of TR_EXIT_STATUS is still available, but it must be explicitly enabled in blade.config using the profile setting:
"TR_EXIT_STATUS_terminate": 1,
Blade record visibility flag -- The Dashboard blade list display is created from database records describing each tractor-blade instance that has connected to the engine in the past. These records are retained, even when a blade host is no longer deployed, in order to correlate previously executed commands with the machine they ran on. The dashboard blade list menu item "Clear prior blade data" no longer removes the actual database record for the given blade. Instead it simply sets a flag that hides the record from display in the dashboard. The record (and its new unique id field) are now retained for correlation with old task records. The blade data items can be completely removed manually if they are truly unneeded.
Cookie-based Dashboard relogin -- A new policy allows auto-relogin to new Dashboard windows based on a saved session cookie, even when site passwords are enabled. The cookie contains only a session ID that is validated by the engine, it does not contain any password data itself. The older policy that denied auto-login when passwords are required can be restored by adding a "_nocookie" modifier to the crews.config SitePasswordValidator setting.
Added a new tractor-dbctl --set-job-counter option that sets the initial job ID value in a new job database. Job IDs start 1 by default, so this ability to specify a different starting value can be helpful when starting from a fresh Tractor install in order to prevent overlaps between the job IDs from the new install and older jobs. Tractor upgrade installs that reuse the prior job database will continue to see job ID continuity.
Several internal improvements have been made to the job database upgrade proceedure. Many code-related changes in new releases can now be applied without a significant database alteration, needing only an engine restart. Changes involving new database schema definitions are now applied with a system that better handles upgrades across multiple versions.
Overall throughput optimizations -- Various performance improvements have been made in the this release, especially with regards to handling large numbers of simultaneous updates as many jobs complete or are deleted at the same time.