1. Purpose
The objective of this plan is to provide a guideline to the organization to continue managing the business through MFGPro and to minimize the disturbance to manufacturing operations in the event that the MFGPro system is totally unavailable and cannot be recovered within 48 hours.
The service provided during the disaster recovery is of survival nature, may not extend to every user, and may require end-users to re-enter the backlog transactions at the commencement of the disaster recovery and at re-starting of the normal operation.
The recovery option will be varied in nature depending on the cause of the outage. This may require set-up of LAN server, secondary processing center, etc.
2. Scope
The MFGPro Disaster Recovery Procedure is applicable to the Company sites in cases in which the MFGPro system is completely unavailable to all of the users and recovery is estimated to take more than 48 hours to complete.
3. Owner
Company, IT.
4. Policy
To facilitate the proper execution of the MFGPro Disaster Recovery Procedure, the following policy guidelines should be taken into account.
- First of all, the MFGPro Disaster Recovery Procedure should be communicated to all involved with this procedure.
- Second, the actual procedure should be regularly (bi-annually) simulated/drilled according to the prespecified disaster recovery drill plan (see Appendix C). Disaster simulation reduces the opportunity for miscommunication when the plan is implemented during a real disaster. It also offers management an opportunity to spot weaknesses and improve procedures.
- Third, reciprocal agreements between Shanghai and HK must be existent, up-to-date, and verified.
- Fourth, the MFGPro Disaster Recovery Procedure should be periodically evaluated and adjusted accordingly to fit the constantly changing (IT) environment.
- Finally, the MFGPro Disaster Recovery Procedure should not be regarded as all comprehensive, as the actual disaster scenarios will likely differ from the cases described in this procedure and will require ad-hoc activities. Nevertheless, the MFGPro Disaster Recovery Procedure provides a general framework for properly addressing MFGPro related disaster scenarios.
5. Roles and Responsibilities
Effective response to disaster scenarios requires a coordinated effort by all stakeholders. This implies that disaster recovery should be a team effort rather than an individual one. The following sections present an overview of the command structure and roles and responsibilities applicable to the Company site.
5.1. Command structure
Team | Members |
Disaster Recovery Management Team |
|
Disaster Recovery Team |
|
Disaster Recovery Manager |
|
Operations Representatives |
|
IS Application Team |
|
IS System Team |
|
5.2. Roles and responsibilities
Role | Responsibilities |
Disaster Recovery Management Team |
|
Disaster Recovery Team |
|
Disaster Recovery Manager |
|
Operations Representatives |
|
IS Application Team |
|
IS System Team |
|
5.3. Communication plan
The conceptual communication plan flows that, once executed, will quickly inform the main participants on the disaster situation. Due to the large turnover rate presently at Company, it is not possible (yet) to assign actual names and telephone numbers to the plan.
6. Definition and Abbreviations
6.1. Definitions
Disaster: A disaster here can be defined as the situation in which the complete MFGPro system is out of operation and for which recovery is estimated to require more than 48 hours, regardless of the cause of the problem (e.g. hardware, network, or software problems).
Disaster recovery: The process of recovering from a disaster by continuing the execution of services in another computer center at another location.
6.2. Abbreviations
N/A.
7. Procedure details
7.1. Procedure definition
This section describes the procedures for each of the following possible scenarios (as identified by IT) with corresponding recovery options (detailed steps for setting up the test server are provided in Appendix B).
- Power failure > 48 hours
- LAN outages > 48 hours
- Hardware failure > 48 hours
- WAN failure > 48 hours
- MFGPro system and/or progress and/or operating system failure > 48 hours
- Total system failure > 48 hours
All of the above disaster scenarios are addressed using a standard protocol (see 7.2) that covers the disaster recovery kick-off, the specific recovery measures, and the restoration to prefailure status.
7.1.1. Power failure
Possible Incidents
In this scenario, the power supply to the Shanghai server room is interrupted. This affects the MFGPro application hosted on the production server in Shanghai server room. Upon interruption, the power supply will automatically switch to the two UPS units available in the server room. UPS, however, can only supply power for a relatively short time (1-2 hours in our case).
Risk Assessment
Recovery Options
If the power supply to the MFGPro production server fails, the MFGPro system must be transferred to the test server located in Shanghai, second floor. Users that interact with the MFGPro system must accommodate by moving to the location of the test/temporary server to resume critical tasks. If Shanghai is also out of power, Hong Kong will be requested to restore our system (see 7.1.5).
7.1.2. LAN outages
Possible Incidents
This scenario assumes the LAN to be unavailable due to failure of the physical lines.
Risk Assessment
LAN unavailability due to failure of physical carrier lines can be considered as a “High Risk” issue, as Shanghai does not exercise control over these lines and therefore can only take measures to accommodate the situation until the problem is solved by a third party organization.
Recovery Options
In the case of LAN unavailability due to failure of line “1”, Shanghai users will temporarily relocate to Shanghai 2 office where a direct connection to the system can be established. In the case of failure of line “2”, Shanghai users will temporarily relocate to Shanghai 1 office. For scenario “C”, please refer to section 7.1.3. Since all Shanghai sites are located within convenient distance from each other, providing connection to the MFGPro system through dial-up should only be considered as a second option. Nonetheless, dial-up access should be considered for those remote users in for example Hong Kong and Holland.
7.1.3. Hardware failure
Possible Incidents
This scenario assumes failure of the hardware of the production server that hosts the MFGPro system. This may be the result of a variety of causes such as fires, disk crashes, CPU errors, RAM failure, and even corporate theft.
Risk Assessment
This scenario can be considered as “Medium Risk”, for the reason that this type of failure is not uncommon. Moreover, spare parts may not always be available from vendors within the timespan of 48 hours.
Recovery Options
In this scenario, assuming that spare parts are not available within 48 hours, the test server at SH2 will be configured to host the MFGPro system. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the test server. Moreover, remote users (e.g. those in Hong Kong, Japan, etc.) will be immediately informed on the new server address.
7.1.4. WAN failure
Possible Incidents
In this scenario, connection in Shanghai to the WAN is lost due to hardware (e.g. a defect router) or carrier issues (e.g. problems with the physical line).
Risk Assessments
This case is rated as “Low Risk”, mainly because Company is able to continue its operations without implications. However, users other than those in Shanghai (e.g. Hong Kong, Holland, etc.) will not be able to access the MFGPro system in Shanghai until connection to WAN is restored.
Recovery Options
WAN failure only affects external communication of Shanghai. In case of WAN failure, the users in Hong Kong and Shenzhen will be immediately informed on the situation. These users are expected to accommodate by delaying their transactions or by sending information through other channels of communication to Shanghai for local processing.
7.1.5. MFGPro system and/or progress and/or operating system failure
Possible Incidents
This refers to the situation in which the MFGPro software corrupts, regardless of the actual cause.
Risk Assessment
This is considered as a relatively “Low Risk” problem, as chances for recovery are very high by restoring the system to its prefailure status through means of backup referral.
Recovery Options
If the MFGPro system on the production server fails and recovery is expected to require more than 48 hours, the test server at SH2 will be set up and configured to resume operations. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the test server. Furthermore, remote users (e.g. those in Hong Kong, Kobe, etc.) will be immediately informed on the new server address.
7.1.6. Total system failure
Possible Incidents
This scenario assumes the failure of the both the production server and the test server, and no spare equipment is available for setting up a replacement server. This case can be the result of a major natural disaster such as an earthquake or fire.
Risk Assessment
This is considered as a “Medium Risk” problem, because the chances for such scenario to occur are very slim. Nevertheless, if it occurs, our operations and business continuance will be affected in a way not comparable with other scenarios.
Recovery Options
If both the Shanghai 1 production server and the Shanghai 2 test server are down, and setting up a replacement server (UNIX!) is not possible, the disaster recovery manager will order the backup tape to be send to Shenzhen (which forwards the tape to HK) for restoration. Using the backup tape (with full backup of our MFGPro application and settings), the MFGPro system will be restored on a server in Hong Kong. It must be noted, however, that Hong Kong can only provide us with a cold stand-by for restoring our system. MFGPro users’ workstations and desktops need to be reconfigured (i.e., change target IP) to ensure access to the recovery server.
7.2. Procedure flow chart
TBD
7.3. Secondary processing center: user allocation
Once a secondary processing center (e.g. test or temporary server) has been set up, access to the system must be allocated to a limited number of users and remote users in order for them to process transactions while safeguarding system stability. The following access distribution is proposed by the disaster recovery management team.
SH | HK | SZ | |
Order Desk | 1 | 1 | |
PMC | 2 | ||
Store/Shipping | 3 | 1 | |
Production | 4 | ||
Purchasing | 1 | ||
Logistics | 1 | ||
F&A | 10 | ||
IT | 2 | ||
Total | 24 | 1 | 1 |
8. References
N/A
9. APPENDIX A: MFGPRO NETWORK SETUP
TBD
10. APPENDIX B: MFGPRO SERVER SPECIFICATIONS
TBD
11. APPENDIX C: MFGPRO DISASTER RECOVERY DETAILS
1. Checking the backup machine environment setup:
The backup machine address is normally assigned to MFGPro customization program development. Before proceeding to the setup of the secondary MFGPro processing center, the following settings should be confirmed:
- a. The MFGPro programs are similar to those residing in the production machine.
- b. Activate the temporary MFGPro user ID’s according to the following list:
- Department ID’s:
-Order Desk –> rs50sun
-PMC –> pmc50sun, pmc60sun, pmc55sun
-Store –> wh50sun1, wh50sun2, wh50sun3, wh55sun1, wh55sun2
-Production –> mo50sun, mo60sun1, mo60sun2, mo60sun3, mo65sun1, mo65sun2,mo65sun3, mo65sun4
-Purchasing –> pur50sun
-Logistics –> pmc70sun
-F&A –> fa50sun, fa80sun, fa90sun
-IT –> its0test
- c. Disable the testing MFGPro user ID, “its85”.
- d. Check the disk space of the “/data” directory (around 5 Gbyte free or more);
2. Reload the MFGPro data using the command: “tar xvf /dev/rmt/0m /data/vol1/*.* /data/vol2/*.*”;
3. Start the server using the command: “sh /mfgproce/85InstallDir/start.Conv8”;
4. Test the recovered MFGPro system:
Login using the ID “its0test”, then check the last inventory transaction date and last feedback transaction date in order to confirm the last update date of the recovered system.
5. Inform the production dept. that the recovered MFGPro system is online;
6. When the production machine has been fixed and resumed to its normal operational status, stop the recovered MFGPro System and make a backup of it:
a. Run the command: “sh /mfgproce/85InstallDir/stop.Conv8”.
b. Backup the database: “tar cvf /dev/rmt/0m /data/vol1/*.* /data/vol2/*.*”;
7. Transfer the backup MFGPro database in the backup machine to the production machine using command: “tar xvf /dev/rmt/0m /data”;
8. Start the MFGPro database and test the reloaded MFGPro database;
9. Inform production that the MFGPro system is back to normal status and re-capture the transactions received during the recovery period.
12. APPENDIX D: MFGPRO DISASTER DRILL Exercises
Drill Exercise | Applicable to Scenarios | Drill Instructions |
1. Move MFGPro system to test server at SH2 | Power failure, Hardware failure, MFGPro system failure |
|
2. Relocate users to allow for direct connection to MFGPro system at SH1 | LAN failure |
|
3. Temporarily exclude remote users from accessing MFGPro system at SH1 | WAN failure |
|
4. Send backup tape (per express or courier) to Shenzhen for restoration | Total system failure |
|
5. Set up dial-up connection to access MFGPro system | All scenarios |
|
13. APPENDIX E: MFGPRO DISASTER DRILL SCHEDULE
TBD
14. APPENDIX F: MFGPRO DISASTER DRILL FORM
TBD