|
Virtually every possible effort has been taken to ensure that an IncrediBuild environment retains
the highest possible level of reliability and availability, sustaining scenarios such as network
disconnects or server/client nodes becoming unavailable during the execution of a distributed job.
Agent Disconnect Recovery
XGE (Xoreax Grid Engine) technology utilizes a transactional model to prevent incomplete
execution of build tasks. Accordingly, if an Agent executing a remote task becomes unavailable
(for any reason) and is unable to complete the task execution or to send back its output, any
output files created during this task's execution are discarded and the task is assigned to
another Agent. The distributed job's integrity is thus fully kept.
Dynamic Resource Assignment
In the event that an Agent becomes unavailable during a distributed job's execution, the job will
not simply "lose" a computing resource. Taking into account currently running jobs along with the
connected Agents' processing power and availability, the Coordinator may dynamically re-assign
Agents to running jobs in order to ensure all jobs are utilizing the optimal set of resources.
Since IncrediBuild uses the central "Coordinator" component to handle resource assignment, it is
crucial for the system to remain operational even if the Coordinator becomes unavailable. To
achieve this, A Backup Coordinator may be set up. The Backup Coordinator assumes control
whenever the primary Coordinator becomes unavailable for any reason, alerting users of the
condition but otherwise maintaining all system functionality. Once the primary Coordinator is
restored, normal operation is resumed.
|