Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anchor
2.2
2.2
Upgrading to 2.2

  • NOTE: Upgrading to Tractor 2.2 is "permanent" in the sense that you cannot revert to an older tractor-engine while also retaining your old jobs once the 2.2 job database upgrade has been performed. If you BACKUP your current job database before installing 2.2, then it is possible to revert to the older engine version along with jobs restored to their state at the time of the backup. Please refer to the guidelines described in Upgrading.
  • Upgrading to 2.2
  • Upgrading to 2.1
  • Upgrading from 1.x

...

Anchor
1614082
1614082
Changes in 2.2 1614082

  • Added better user attributions in log messages related to job deletes and task task actions such as skip, kill, recall, and retry.
  • Improved tractor-engine self recovery from unexpected command signature changes that could result in dispatching stalls in some jobs, and an "assigner Cmd not Ready?" warning in the logs.
  • Fixed an engine problem that would sometimes update task records too early on active tasks that were in the midst of being swept from active blades during a manual retry of a predecessor tasks. The failed update could lead to an incorrect report of the number of active slots on the given blade.
  • Fixed enforcement of job attribute edit policies such that dashboard edits cannot be applied to a job that is being moved to the archives (aka deleted).
  • Fixed a tractor-blade state checkpoint problem that could sometimes cause state reporting delays when a blade was rebooted.
  • Fixed an engine crash related to handling a deliberately malformed URL query.
  • Added additional exception handler protection in tractor-blade to guard against errors in user-provided custom TractorSiteStatusFilter extensions.
  • Fixed a problem handling very long limit tag names.
  • Improved the dashboard efficiency related to automatic refreshes of job and blade lists, reducing load on the engine and database when many users are connected.
  • Fixed an RPM specification issue that could result in RPM install error messages like "Transaction check error: file /opt from install ..."
  • Fixed tractor init.d scripts to return a non-zero status code (3) when the "status" query is used to whether the service is running or not.
  • Worked around an init.d built-in function issue that could sometimes result in a "dirname" error in blade service start up.
  • Introduced optional new systemd start up scripts for tractor-engine and tractor-blade, for use on RHEL7 systems for example. The scripts are shipped in /opt/pixar/Tractor-2.2/lib/SystemServices/systemd/. See the documentation section Linux - with systemd

...

Anchor
1593580
1593580
Changes in 2.2 1593580

  • Fixed an issue where a task retry after job pause could allow that task and its successor to both become runnable concurrently after the job was unpaused.
  • Fixed engine logic that could generate spurious "retry successor task loop detected" log messages in some complex tractor Instance use cases.
  • The "rfm--maya-" environment block in the new stock shared.*.envkeys configuration files now includes additions that allow xgen procedurals to load correctly when batch rendering from Tractor. Copy the "shared.*.envkeys" files to your tractor config directory, or integrate similar changes into your "rfm" handler block if you have customized files.
  • Changed the shipped example task custom menu item to avoid confusion with a similar example in documentation.

...

Anchor
1557505
1557505
Changes in 2.2 1557505

Features

  • The Python job Python job authoring API now API now supports setting of job-level serialsubtasks and spoolcwd attribute.
  • The "Clear earlier blade data" operation in the Dashboard'blade list context s blade list context menu causes the selected blades to be hidden from view until they register again. This operation now also causes an internal cache to recalculate the active task count and number of slots in use for the selected blades. This is useful should sites observe invalid values in the blade list.
  • Pressing 'Z' in the dashboard causes the dashboard causes the view to scroll to the currently selected item.
  • A new command line operation, tq  tq queuestats, displays internal queue information from the engine. This is useful in debugging engine backlogs under unusual load.
  • A new command line operation, tq dbreconnect, causes the engine to reestablish its database connections. This administrative operation may be useful in a several unusual situations. For example, dbreconnect can reclaim accumulated system memory consumed by a bug in PostgreSQL when new large jobs are submitted.

Fixes

  • Fixed bug in which Dashboard would display incorrect task counts in job list.
  • Fixed bug in which the stoptime and process metrics of a command invocation may not be updated if the engine was restarted while the command was running.
  • Fixed bug in which a command invocation's current flag was not getting cleared if its task was retried while the command was running. This addresses multiple problems reported in the Dashboard, such as multiple blades reported for a command and unnecessary vertical spaces appearing in the job graph.
  • Fixed bug in tractor-spool in which using the --engine option with the default engine, namely tractor-engine:80, was not being observed if the TRACTOR_ENGINE environment variable was set.
  • Fixed a bug causing "linked.joblist" messages to appear in the engine log.
  • Fixed blade list and blade activity views so that selecting a blade in one view will cause the selected blade to become visible in the other view.
  • Fixed item lists so that when an out-of-viewport item is selected with the up or down arrow keys, the selected item will automatically be scrolled into view.
  • Fixed broken client-side search box by adding a check for null values.

Optimizations

  • Improved tq responsiveness through additional threads to handle query execution.
  • Optimized task skip operation, reducing database I/O and message payload to Dashboard.

...

Anchor
1496411
1496411
Changes in 2.1 1496411

  • Dashboard Job Notes -- A new Notes field has been added to the Dashboard job details pane, allowing text annotations to be added to any job. Notes are visible to other users, and the presence of a note is indicated with a small "chat bubble" icon in the job list. These notes can be used to describe a problem to wranglers, or to explain why a job needs, or is getting, special handling. The engine will automatically add a note to a job when an attribute is changed through some user action, such as altering priority, so the notes become a history of changes to the job.

  • Dashboard Blade Notes -- A new Notes field has been added to the Dashboard blade details pane, allowing text annotations to be attached to a blade entry. These notes can be used by system administrators to describe known issues or to discuss ongoing admin work on a machine.

  • Dashboard Job Pins -- Individual jobs in each user's job list can now be "pinned" to the top of the list, independent of the global list sorting mode. Jobs might be pinned because they are important to track or just because they represent a current "working set" of jobs. The group of pinned jobs float at the top of the list, and they are sorted according to the overall list sorting mode, within the pinned group.

  • Dashboard Job Locks -- A single user can now "lock" a job from the Dashboard. A locked job can only be modified by the user who locked it. Locks are typically only used by wranglers who are investigating a problem and who want to prevent other users from changing, restarting, or deleting a job while the investigation is proceeding. The lock owner can unlock the job when done. Permission to apply a lock is controlled by the JobEditAccessPolicies "lock" attribute in crews.config.

  • Task Logs 'L' Hotkey -- When navigating the tasks within a job, the logs for the currently selected task can be display by pressing the 'L' key. The key is a toggle, so pressing 'L' again will close the currently open log.

  • User-centric Job Shuffle - Individual users can re-order their own jobs on the queue without disrupting global priority settings. The dashboard job list option "Shuffle Job To Top" essentially exchanges the "place in line" of the selected job with a job submitted earlier from the same user, causing the selected job to run sooner than it would in the default submission order. This swap does not affect the ordering of other jobs on the queue, relative to the submission slots already held by that user. This slightly unusual feature is a simplified re-implementation of the old per-user dispatching order controls in Alfred, as requested by several customers. Permission to perform this kind of reordering is controlled by the JobEditAccessPolicies "jshuffle" attribute in crews.config.

  • The "project" affiliations for each job are now displayed in the job list view.

  • "Delete Job" action is now called "Archive Job" -- The former "Delete Job" menu item has been changed to "Archive Job" to better reflect its actual function: when the db.config setting "DBArchiving" is enabled, jobs that are removed from the active queue are transfered to an archive database where they can still be inspected and searched in tq queries. If DBArchiving is False, then "deleted" jobs are actually deleted and their database entries are removed -- in this case the dashboard menu item still says "Delete Job".

  • Archived Jobs View -- A Dashboard view of previously "deleted" (archived) jobs is now available. This view is analogous to a "trash can view" in some file browsers or e-mail clients. Jobs listed in the archive view can be browsed, and can also be restored to the main job queue where they can again be considered for dispatching. Note that jobs can sometimes contain "clean-up" commands that execute when they finish executing. These clean-ups may remove important temporary files that can make it impossible to re-execute that job.

  • Task progress bars for Nuke renders -- Tractor-blade now triggers a Dashboard progress bar update when it encounters a multi-frame progress message from Nuke, of the form "Frame 42 (7 of 9)".

  • Task Elapsed Time Bounds -- Job authors can now specify an acceptable elapsed time range for a given launched command. Commands whose elapsed time is outside the acceptable range will be marked as an error. Commands that run past the maximum time boundary will be killed. Example job script syntax:

    RemoteCmd {sleep 15} -service PixarRender -minrunsecs 5 -maxrunsecs 20
    
  • Per-Tier Scheduling -- A new extension to the DispatchTiers specification in tractor.config allows each defined tier to have its own scheduling mode. For example, the "rush" tier might be schedule in a strict FIFO order, whereas the default mode might be one of the modes that favors shared-access (like P+ATCL+RR). Tiers can be assigned the new "P+CHKPT" mode to take advantage of partial-graph looping feature in Tractor 2.0; and tiers using that mode should be placed before tiers receiving "classic" non-checkpoint jobs.

  • Site-define Task Log Filters -- A new FilterSubprocessOutputLine() method is now available as an advanced customization feature in the TractorSiteStatusFilter module. This method provides python access to every line of task output. The site-written code can perform arbitrary actions in response to task output, and built-in Tractor-specific actions are also available. These include marking the task as an error, generating percent-done progress updates, initiating a task graph "expand" action, and stripping the output line from the logs.

  • GPU Detection -- On start-up, tractor-blade now makes an attempt to enumerate any GPU devices installed on the blade host. The device model and vendor name "labels" are made available during the profile selection process so that groups of blades can be categorized by the presence or type of GPU, if desired. The "Hosts" dictionary in a blade.config profile definition defines the matching criteria for that profile. Two new optional keys are now available: the "MinNGPU" entry specifies minimum number of GPU devices required for a match; and "GPU.label" specifies a wildcard-style matching string for a particular vendor/model. This label string also now appears in the Dashboard blade list, if a GPU device is found.

  • The new tractor.config setting "CmdAutoRetryStopCodes" specifies a list of exit codes that will be considered "terminal" -- automatic retries will NOT be considered for commands that exit with these codes, unless the -retryrc list for a specific command requests it. Negative numbers represent unix signal values, and the codes 10110 and 10111 are generated when a command's elapsed time falls outside the new run-time bounds options, when given. The default setting for the no-retry stop codes are the values for SIGTERM, SIGKILL, and the two time-bounds codes:

    "CmdAutoRetryStopCodes": [-9, -15, 10110, 10111],
    
  • Engine statistics query -- A new URL request (Tractor/monitor?q=statistics) has been added to help integrate tractor-engine performance metrics with other site-wide monitoring systems. The returned JSON object contains the most recent sample of several statics that the engine collects about itself. This data might be used, for example, to populate an external site monitoring system. Some monitoring systems are able to make this URL request for data directly, while others may require a small data source script to be written that requests the JSON statistics report and then forwards each value of interest to the monitoring system separately.

  • Concurrent Expand Chunks -- This advanced expand task variant advanced expand task variant provides one approach to avoiding serial delays in jobs containing long-running single commands that produce a sequence of results needed by other tasks in the job. This new extension enables pipeline integrators to construct jobs that launch a long running command, such as a fluid simulation, and then concurrently launch another command, such as a render, when each sequential output file is generated by the first command. Thus rendering can proceed without waiting for all of the simulation steps to complete. This particular approach is well suited to cases where the simulation app is creating output files whose filenames are not known ahead of time, and thus the subsequent render command line arguments must be generated dynamically. The simulation, or a wrapper script, detects when the next step is complete, then it writes the appropriate rendering Task description into a temporary file, and then notifies tractor-blade by emitting the new 'TR_EXPAND_CHUNK "filename"n' line on stdout. Tractor-blade will detect that directive in the application stdout stream and deliver the file contents to the engine. The new render task is inserted into the running job and can be dispatched immediately elsewhere on the farm. The blade will automatically remove the temporary file once it has been delivered.

  • TR_EXIT_STATUS auto-terminate policy change -- the default behavior for the TR_EXIT_STATUS handler has now reverted to the 1.x and earlier 2.x behavior in which the status value is simply recorded and then reported later when the command actually exits. The more recent behavior in which the blade actively kills the app upon receipt of TR_EXIT_STATUS is still available, but it must be explicitly enabled in blade.config using the profile setting:

    "TR_EXIT_STATUS_terminate": 1,
    
  • Blade record visibility flag -- The Dashboard blade list display is created from database records describing each tractor-blade instance that has connected to the engine in the past. These records are retained, even when a blade host is no longer deployed, in order to correlate previously executed commands with the machine they ran on. The dashboard blade list menu item "Clear prior blade data" no longer removes the actual database record for the given blade. Instead it simply sets a flag that hides the record from display in the dashboard. The record (and its new unique id field) are now retained for correlation with old task records. The blade data items can be completely removed manually if they are truly unneeded.

  • Cookie-based Dashboard relogin -- A new policy allows auto-relogin to new Dashboard windows based on a saved session cookie, even when site passwords are enabled. The cookie contains only a session ID that is validated by the engine, it does not contain any password data itself. The older policy that denied auto-login when passwords are required can be restored by adding a "_nocookie" modifier to the crews.config SitePasswordValidator setting.

  • Added a new tractor-dbctl --set-job-counter option that sets the initial job ID value in a new job database. Job IDs start 1 by default, so this ability to specify a different starting value can be helpful when starting from a fresh Tractor install in order to prevent overlaps between the job IDs from the new install and older jobs. Tractor upgrade installs that reuse the prior job database will continue to see job ID continuity.

  • Several internal improvements have been made to the job database upgrade proceedure. Many code-related changes in new releases can now be applied without a significant database alteration, needing only an engine restart. Changes involving new database schema definitions are now applied with a system that better handles upgrades across multiple versions.

  • Overall throughput optimizations -- Various performance improvements have been made in the this release, especially with regards to handling large numbers of simultaneous updates as many jobs complete or are deleted at the same time.

...