1.        Purpose

The objective of this plan is to provide a guideline to the organization to continue managing the business through MFGPro and to minimize the disturbance to manufacturing operations in the event that the MFGPro system is totally unavailable and cannot be recovered within 48 hours.

The service provided during the disaster recovery is of survival nature, may not extend to every user, and may require end-users to re-enter the backlog transactions at the commencement of the disaster recovery and at re-starting of the normal operation.

The recovery option will be varied in nature depending on the cause of the outage. This may require set-up of LAN server, secondary processing center, etc.

 

2.        Scope

The MFGPro Disaster Recovery Procedure is applicable to the Company sites in cases in which the MFGPro system is completely unavailable to all of the users and recovery is estimated to take more than 48 hours to complete.

 

3.        Owner

Company, IT.

 

4.        Policy

To facilitate the proper execution of the MFGPro Disaster Recovery Procedure, the following policy guidelines should be taken into account.

  • First of all, the MFGPro Disaster Recovery Procedure should be communicated to all involved with this procedure.
  • Second, the actual procedure should be regularly (bi-annually) simulated/drilled according to the prespecified disaster recovery drill plan (see Appendix C). Disaster simulation reduces the opportunity for miscommunication when the plan is implemented during a real disaster. It also offers management an opportunity to spot weaknesses and improve procedures.
  • Third, reciprocal agreements between Shanghai and HK must be existent, up-to-date, and verified.
  • Fourth, the MFGPro Disaster Recovery Procedure should be periodically evaluated and adjusted accordingly to fit the constantly changing (IT) environment.
  • Finally, the MFGPro Disaster Recovery Procedure should not be regarded as all comprehensive, as the actual disaster scenarios will likely differ from the cases described in this procedure and will require ad-hoc activities. Nevertheless, the MFGPro Disaster Recovery Procedure provides a general framework for properly addressing MFGPro related disaster scenarios.

 

5.        Roles and Responsibilities

Effective response to disaster scenarios requires a coordinated effort by all stakeholders. This implies that disaster recovery should be a team effort rather than an individual one. The following sections present an overview of the command structure and roles and responsibilities applicable to the Company site.

5.1.    Command structure

Team Members
Disaster Recovery Management Team
  • Site Operation Manager
  • Site Manufacturing Manager
  • Site Financial Controller
  • Site IS Manager
Disaster Recovery Team
  • Disaster Recovery Manager (Team Representative)
  • Operations Representatives
  • F&A Representatives
  • IS Representatives (including Hong Kong, Shanghai and Shenzhen)
Disaster Recovery Manager
  • IS Manager (Primary)
  • IS Services Manager (Backup)
Operations Representatives
  • Factory Representative
  • Logistics System Manager
IS Application Team
  • MFGPro Application Specialist (Primary Representative)
  • CAD/CAM Application Specialist
IS System Team
  • IS System Team Members

5.2. Roles and responsibilities

Role Responsibilities
Disaster Recovery Management Team
  • To act as the owner of the recovery operation.
  • To decide whether or not to initiate the MFGPro Disaster Recovery Procedure (kick-off to “disaster recovery mode”).
  • To inform the organisation of the disaster and to arrange the recovery actions.
  • To determine who receives priority regarding system access.
  • To regularly receive updates from the disaster recovery manager and to make decisions regarding resource allocation and additional required financial funding.
  • To ensure that disaster simulations are regularly performed.
  • To evaluate and adjust the MFGPro Disaster Recovery Procedure based on the results obtained from simulating disasters.
Disaster Recovery Team
  • To assist and consult the disaster recovery manager in his activities.
Disaster Recovery Manager
  • To act as the focus point in the organization and to coordinate with various departments/functions.
  • To manage the team’s recovery operation in order to assure rapid, smooth, and effective response.
  • To conduct daily control meetings with all of the sub-teams aimed at reviewing issues and assigning action items.
  • To report daily to the disaster recovery management team on key issues affecting the continuity of business operations.
  • To plan, schedule, and execute disaster simulation.
Operations Representatives
  • To coordinate with the individual user departments in order to ensure an orderly switch to “disaster recovery mode”.
  • To work with the IS application and system teams in conducting simple testing on the setting of the backup server.
  • To coordinate with the individual user departments in capturing backlog transactions and performing tests should this be needed.
  • To work with the individual user departments in preparing a document outlining in what way the business transactions are managed in “disaster recovery mode”.
IS Application Team
  • To ensure recovery of the system and database.
  • To provide application support to the disaster recovery team.
  • To participate in the execution of disaster recovery drill.
IS System Team
  • Ensuring hardware (server & PC) and network recovery.
  • Ensuring operating system recovery.
  • Providing technical support to the disaster recovery team.
  • To participate in the execution of disaster recovery drill.

5.3.    Communication plan

The conceptual communication plan flows that, once executed, will quickly inform the main participants on the disaster situation. Due to the large turnover rate presently at Company, it is not possible (yet) to assign actual names and telephone numbers to the plan.

6.        Definition and Abbreviations

6.1.     Definitions

Disaster:                      A disaster here can be defined as the situation in which the complete MFGPro system is out of operation and for which recovery is estimated to require more than 48 hours, regardless of the cause of the problem (e.g. hardware, network, or software problems).

Disaster recovery:      The process of recovering from a disaster by continuing the execution of services in another computer center at another location.

6.2.    Abbreviations

N/A.

7.        Procedure details

7.1.    Procedure definition

This section describes the procedures for each of the following possible scenarios (as identified by IT) with corresponding recovery options (detailed steps for setting up the test server are provided in Appendix B).

  • Power failure > 48 hours
  • LAN outages > 48 hours
  • Hardware failure > 48 hours
  • WAN failure > 48 hours
  • MFGPro system and/or progress and/or operating system failure > 48 hours
  • Total system failure > 48 hours

All of the above disaster scenarios are addressed using a standard protocol (see 7.2) that covers the disaster recovery kick-off, the specific recovery measures, and the restoration to prefailure status.

7.1.1.     Power failure

Possible Incidents

In this scenario, the power supply to the Shanghai server room is interrupted. This affects the MFGPro application hosted on the production server in Shanghai server room. Upon interruption, the power supply will automatically switch to the two UPS units available in the server room. UPS, however, can only supply power for a relatively short time (1-2 hours in our case).

Risk Assessment

Recovery Options

If the power supply to the MFGPro production server fails, the MFGPro system must be transferred to the test server located in Shanghai, second floor. Users that interact with the MFGPro system must accommodate by moving to the location of the test/temporary server to resume critical tasks. If Shanghai is also out of power, Hong Kong will be requested to restore our system (see 7.1.5).

7.1.2.        LAN outages

Possible Incidents

This scenario assumes the LAN to be unavailable due to failure of the physical lines.

Risk Assessment

LAN unavailability due to failure of physical carrier lines can be considered as a “High Risk” issue, as Shanghai does not exercise control over these lines and therefore can only take measures to accommodate the situation until the problem is solved by a third party organization.

Recovery Options

In the case of LAN unavailability due to failure of line “1”, Shanghai users will temporarily relocate to Shanghai 2 office where a direct connection to the system can be established. In the case of failure of line “2”, Shanghai users will temporarily relocate to Shanghai 1 office. For scenario “C”, please refer to section 7.1.3. Since all Shanghai sites are located within convenient distance from each other, providing connection to the MFGPro system through dial-up should only be considered as a second option. Nonetheless, dial-up access should be considered for those remote users in for example Hong Kong and Holland.

7.1.3.     Hardware failure

Possible Incidents

This scenario assumes failure of the hardware of the production server that hosts the MFGPro system. This may be the result of a variety of causes such as fires, disk crashes, CPU errors, RAM failure, and even corporate theft.

Risk Assessment

This scenario can be considered as “Medium Risk”, for the reason that this type of failure is not uncommon. Moreover, spare parts may not always be available from vendors within the timespan of 48 hours.

Recovery Options

In this scenario, assuming that spare parts are not available within 48 hours, the test server at SH2 will be configured to host the MFGPro system. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the test server. Moreover, remote users (e.g. those in Hong Kong, Japan, etc.) will be immediately informed on the new server address.

7.1.4.     WAN failure

Possible Incidents

In this scenario, connection in Shanghai to the WAN is lost due to hardware (e.g. a defect router) or carrier issues (e.g. problems with the physical line).

Risk Assessments

This case is rated as “Low Risk”, mainly because Company is able to continue its operations without implications. However, users other than those in Shanghai (e.g. Hong Kong, Holland, etc.) will not be able to access the MFGPro system in Shanghai until connection to WAN is restored.

Recovery Options

WAN failure only affects external communication of Shanghai. In case of WAN failure, the users in Hong Kong and Shenzhen will be immediately informed on the situation. These users are expected to accommodate by delaying their transactions or by sending information through other channels of communication to Shanghai for local processing.

7.1.5.     MFGPro system and/or progress and/or operating system failure

Possible Incidents

This refers to the situation in which the MFGPro software corrupts, regardless of the actual cause.

Risk Assessment

This is considered as a relatively “Low Risk” problem, as chances for recovery are very high by restoring the system to its prefailure status through means of backup referral.

Recovery Options

If the MFGPro system on the production server fails and recovery is expected to require more than 48 hours, the test server at SH2 will be set up and configured to resume operations. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the test server. Furthermore, remote users (e.g. those in Hong Kong, Kobe, etc.) will be immediately informed on the new server address.

7.1.6.     Total system failure

Possible Incidents

This scenario assumes the failure of the both the production server and the test server, and no spare equipment is available for setting up a replacement server. This case can be the result of a major natural disaster such as an earthquake or fire.

Risk Assessment

This is considered as a “Medium Risk” problem, because the chances for such scenario to occur are very slim. Nevertheless, if it occurs, our operations and business continuance will be affected in a way not comparable with other scenarios.

Recovery Options

If both the Shanghai 1 production server and the Shanghai 2 test server are down, and setting up a replacement server (UNIX!) is not possible, the disaster recovery manager will order the backup tape to be send to Shenzhen (which forwards the tape to HK) for restoration. Using the backup tape (with full backup of our MFGPro application and settings), the MFGPro system will be restored on a server in Hong Kong. It must be noted, however, that Hong Kong can only provide us with a cold stand-by for restoring our system. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the recovery server.

7.2.    Procedure flow chart

TBD

7.3.    Secondary processing center: user allocation

Once a secondary processing center (e.g. test or temporary server) has been set up, access to the system must be allocated to a limited number of users and remote users in order for them to process transactions while safeguarding system stability. The following access distribution is proposed by the disaster recovery management team.

  SH HK SZ
Order Desk 1 1
PMC 2
Store/Shipping 3 1
Production 4
Purchasing 1
Logistics 1
F&A 10
IT 2
Total 24 1 1
       

8.        References

N/A

 

9.        APPENDIX A: MFGPRO NETWORK SETUP

TBD

 

10.    APPENDIX B: MFGPRO SERVER SPECIFICATIONS

TBD

 

11.    APPENDIX C: MFGPRO DISASTER RECOVERY DETAILS

1. Checking the backup machine environment setup:
The backup machine address is normally assigned to MFGPro customization program development. Before proceeding to the setup of the secondary MFGPro processing center, the following settings should be confirmed:

  • a. The MFGPro programs are similar to those residing in the production machine.
  • b. Activate the temporary MFGPro user ID’s according to the following list:
  • Department ID’s:

-Order Desk –> rs50sun
-PMC –> pmc50sun, pmc60sun, pmc55sun
-Store –> wh50sun1, wh50sun2, wh50sun3, wh55sun1, wh55sun2
-Production –> mo50sun, mo60sun1, mo60sun2, mo60sun3, mo65sun1,                  mo65sun2,mo65sun3, mo65sun4

-Purchasing –> pur50sun
-Logistics –> pmc70sun
-F&A –> fa50sun, fa80sun, fa90sun
-IT –> its0test

  • c. Disable the testing MFGPro user ID, “its85”.
  • d. Check the disk space of the “/data” directory (around 5 Gbyte free or more);

2. Reload the MFGPro data using the command: “tar xvf /dev/rmt/0m /data/vol1/*.* /data/vol2/*.*”;

3. Start the server using the command: “sh /mfgproce/85InstallDir/start.Conv8”;

4. Test the recovered MFGPro system:
Login using the ID “its0test”, then check the last inventory transaction date and last feedback transaction date in order to confirm the last update date of the recovered system.

5. Inform the production dept. that the recovered MFGPro system is online;

6. When the production machine has been fixed and resumed to its normal operational status, stop the recovered MFGPro System and make a backup of it:
a. Run the command: “sh /mfgproce/85InstallDir/stop.Conv8”.
b. Backup the database: “tar cvf /dev/rmt/0m /data/vol1/*.* /data/vol2/*.*”;

7. Transfer the backup MFGPro database in the backup machine to the production machine using command: “tar xvf /dev/rmt/0m /data”;

8. Start the MFGPro database and test the reloaded MFGPro database;

9. Inform production that the MFGPro system is back to normal status and re-capture the transactions received during the recovery period.

 

12.    APPENDIX D: MFGPRO DISASTER DRILL Exercises

Drill Exercise Applicable to Scenarios Drill Instructions
     
1. Move MFGPro system to test server at SH2 Power failure, Hardware failure, MFGPro system failure
  1. Configure MFGPro system according to Appendix C.
  2. Relocate random users to SH2.
  3. Establish direct connection to the system for these users.
  4. Allow users to interact with system for 30 minutes.
  5. Record test results (Appendix F).
2. Relocate users to allow for direct connection to MFGPro system at SH1 LAN failure
  1. Relocate random users to space near SH1 MFGPro system.
  2. Establish direct connection to the system for these users.
  3. Allow users to interact with system for 30 minutes.
  4. Record test results (Appendix F).
3. Temporarily exclude remote users from accessing MFGPro system at SH1 WAN failure
  1. Inform remote users on disaster test date and time.
  2. Configure MFGPro system to exclude remote users.
  3. Request remote users to send in transactions through other channels for local processing during disaster testing time.
  4. Record test results (Appendix F).
  5. Follow up by user evaluation.
4. Send backup tape (per express or courier) to Shenzhen for restoration Total system failure
  1. Send backup tape to Shenzhen per express or courier.
  2. Contact Hong Kong to restore system on server.
  3. Obtain user IDs from Hong Kong to access the system.
  4. Reconfigure users’ computers to access the server in Hong Kong.
  5. Record test results (Appendix F).
  6. Follow up by user evaluation.
5. Set up dial-up connection to access MFGPro system All scenarios
  1. Ensure dial-up capability on users’ computers.
  2. Access MFGPro system through dial-up.

 

13.    APPENDIX E: MFGPRO DISASTER DRILL SCHEDULE

TBD

14.    APPENDIX F: MFGPRO DISASTER DRILL FORM

TBD