Procedures
Problem Resolution Procedures - Triage
Procedure:
Identify the problem
Assign a key leader to the problem.
How was the problem reported (Service Desk, Ops, user, Service Request …)?
Was the problem correctly filtered before it got here?
Note where questioning improvements can be made.
Does the problem belong to us? If not, report problem to appropriate group to handle and enter or edit the Service Request ticket.
What are the symptoms, errors, messages…?
Identify most likely cause(s).
Determine the resources, personnel required to work on the problem. Key leader gets these people involved.
Determine level of severity.
Problem can be worked on at our convenience.
Problem can be solved in 8-12 hours.
Problem can be solved in 4-8 hours.
Problem must be worked on immediately.
Notify contacts.
Key leader determines who needs to be informed and delegates notification tasks.
Enter Service Request if not already entered.
Give contacts information on symptoms, impact, time estimates to fix.
Inform the Service Desk and Operations; give them two things:
Information in (C) above.
A “message” they can give to users.
Notify management; key leader does this.
Enter status in http://systemstatus.calpoly.edu Web page.
Work on the problem.
Key leader delegates tasks to team members if necessary.
Key leader reports progress to supervisor.
Key leader reports status to Service Desk and updates http://systemstatus.calpoly.edu Web page as appropriate.
Problem Fixed:
Key leader determines who needs to be informed and delegates notification tasks.
Update Service Request ticket(s).
Update http://systemstatus.calpoly.edu Web page.
Formal Problem Review:
Review the problem: how, why, when, and what happened (weekly staff meeting).
How can it be prevented from happening again?
Did the reporting group properly filter the call? Ask the right questions.
Review solution and procedures taken; could it have been done better/easier?
Was or is it a personnel problem? If so, management must deal with it.
Key leader delegates the updating or creating of documentation on the problem or the fix.
Top
Outage Notifications for Services using ITS Resources
Our users depend on the availability of the technology provided to them by ITS in order to do their job and accomplish the mission of the University. If resources are not available, course web pages, network, email, printing, mainframe, servers, or other user access methods, work is impeded and efficiency is lost.Therefore, it is our responsibility to notify our users of any interruption, under the control or management of ITS, that may affect their ability to do their work.This document is a tool to help ITS standardize our notification procedures. It is expected that Service Owners will use this document to help them create outage notification procedures for their users or for the services they provide. This document does not address how the problem will be solved or resolved, only how we will let users know what’s going on.
Reasons for Notifying Users
Maintenance,
System failures (Hardware/Software, human error),
Backups that cause data or systems to be not available,
Hardware or Software upgrades,
Patches,
Network failures,
System/data migration
New implementations/changes,
Holidays/support issues,
viruses,
anything affecting use of a system, function, application, server, or system utility.
What is a service?
A service could be a function, application, or process for which an ITS department or group is responsible for its availability and/or access. If the service is not available, users may not be able to complete work or assignments, or gain access to certain systems.Each campus service that could be affected by an outage of ITS Resources must have written user notification procedures.There are two steps involved with these procedures:
Creating the notification procedures for the service
Executing the notification procedures for the service
Process or Procedure for Notifying Users
Each service must have procedures in place to respond to outages. Following is a checklist to help build those procedures.
Identify the service owner
- The group that has the main responsibility for the system(s), application(s), or function(s), that are/may be affected is also responsible for creating and executing the user notification procedures.
Identify what is affected by the outage (e.g. if the service is an application, then most likely only access to the application is affected, if the service is the machine that runs several applications, then an outage of this service could affect several applications and their user groups)
Identify who will be notified of any outages for this service
Identify the timeframe of the notification (e.g. immediate notification for system failures, advanced notice for scheduled outages - which may vary depending on the reason for the outage - to fix an intermittent problem vs maintenance)
Identify the standard methods for notification (e.g. e-mail, voicemail, systemstatus, console message, town crier, etc.). It’s important to note that more than one method for notification may be necessary or preferable.
Consider the following when determining the best methods of notification:
What does the user say is the best for them
Size of the customer base
Location and distribution of the customer base
Number of notifications that need to be sent out (how many times to send out)
How far ahead of time should the notices be sent
What is the duration of the notification (how long will it last)
What is the sensitivity of the information
What is the most efficient and most likely way people should receive this notice
For each method of notification, the group that is responsible for the non-availability of the baseline service or who is most knowledgeable of the problem is responsible for drafting the information to be used to notify the user community. This group is responsible for:
Create the outage notification. Include information such as:
Explanation of outage
Who does the outage effect (list individuals, groups, users of applications, buildings, servers, etc.)
Who user should contact for questions/problems regarding the outage and the responses to FAQ’s that will be given to the User Support Services
Self direction information (web page, system status, to allow the user to find and access the information for themselves)
Length/duration of outage (when it starts/ends)
Expected results of the outage (what changes, what does not change, explanation of why the outage is occurring)
How the user will be notified or informed when the outage is over, and the results and how they are affected
Get buy-off on clarity of content from the department and a user support area such as Computing Support Services or the Service Desk.
Distribute the notification to those identified in #3 above
Respond to questions, concerns, feedback, etc.
Create process to publish results of outage and provide feedback to system users (e.g. successful - problem fixed, upgrade done - new features include, or failed - here’s what we’re going to try next)
Identify how users will be trained about the notification process used for this service. Once we have established how we will notify users about an outage, we need to make sure they know how we will tell them. If we don’t set their expectations and send them email ehen they were expecting voice mail messages, we’ve failed - even though we’ve “notified” them.
Create process for review and comment of the draft procedures from all parties involved in the notification process (e.g. users, Service Desk, Ops, other groups supporting service). Which is this process, but review it at quarterly intervals to make sure we keep improving the process.
Publish the procedures
Draft Table of Services and Owners
Service Service Owner Operating System Maintenance
Central Systems Administration
Patches to Op/sys
Central Systems Administration
Backups to Central Sys machines
Central Systems Administration
Central Systems server problems
Central Systems Administration
Mainframe maintenance/problems
Central Systems Administration
Printing server - Mainframe
Central Systems Administration
Email maintenance/problems
Central Systems Applications Mgmt
Calendar maintenance/problems
Central Systems Applications Mgmt
User maintenance problems (incl. SOAP)
Central Systems Applications Mgmt
Central Unix Web Server
Central Systems Applications Mgmt
Directory Server
Central Systems Applications Mgmt
Polycard Services (ID Cards, Diebold, PolyPrinting, Door Access)
PolyCard
Alumni systmes maintenance/problems (ADS)
Advancement Systems
Business Application (FRS, IBS, HRS)
A.I.M./Bus. Apps
Student Applications (SIS, Capture, POWER)
A.I.M./Student Applications
Database Applications (Odin, Brio, Focus, TSO)
Application Systems/Data Technologies
Novell/NT Servers managed by PC/LAN
USS/PC/LAN
FWP/SWP Workstations
USS/PC/LAN
Labs
IMS/Labs
Mainframe Operations
USS/Operations
Xprtr reporting
USS/Operations
Teleview
USS/Operations
Modems, Imagine
USS/Service Desk
Training Class
USS/Training
Network
CCS/Network Admin
Telephone
CCS/Telecommunications Services
Web - Cal Poly, CourseInfo
IMS/Web Services
Media Distribution
IMS/MDS
Tools and Methods of User Notification
The following table outlines current methods and proposed methods for the future that may be used to notify our users. The method(s) used to notify our users should be a ‘best fit’ for the customer base and notification type. Note: In some cases you may want to limit the knowledge of the outage to those that ‘need to know’, for example, a security hole.
Current Methods
Pros
Cons
System Status Page
Information is available to everyone ASAP. Can be easily updated.
Must access the web, if web down or not near a machine, can't get info.
Users can be notified quickly. Create distribution lists of users to notigy.
Users don’t read email. Slows down email system. Can't use if that’s the system that’s down. May need multiple emails sent for status updates.
VoiceMail
Can notify users when computers are down
Users may not get call. VoiceMail system is impacted if sending to more than 50 users at a time.
Proposed Methods
Pros
Cons
Scrolling status bar on all ITS networked workstations (Labs)
Can push info out to users immediately. User notices new message on screen. Scrolling continues until status changes.
Must be installed on each workstation. Must purchase s/w and licenses to use.
Electronic Billboards in prime areas such as labs, UU, Library, Admin depts
Notify general public as well as depts/contacts. Updated easily by multiple areas
Costs money. Location specific. Won’t work if mainframe down or routers down.
Standard practice and expectation for LAN Coords to redistribute info to their users
Reduces number of customers for ITS to notify
Need buy-in to this practice. Must have accurate distribution list. Must have backup LAN Coords identified. Does no get information to the general public.
Scrolling marquee on Cal Poly home page and other main web sites
User has multiple areas on the web to see information. Advertise prime locations for information.
S/W required. Requires access by web browser. Information not available if web site is down. Not appropriate for sensitive outages.
System filter to not allow access to down or unavailable systems, and points the user to pertinent information instead.
User would know the problem and be given pertinent information
Must be created for each application and system. May not be technically possible.
Modify system status pages so general information is easy-to-understand messages and explanations. Put technical information behind or hidden (may require password) visible screen so not to clog up screen with tech-info. Easy to understand for general public.
Easy to read for users. Keeps tech info off of the screen.
Must keep tech info separate.
Automated 24 hour/day phone messages with update system status.
Users only have to call to get status. User helps themselves.
Users must call. Messages must be updated.
Modify email clients to display a “news message” when they log into e-mail (has been done using Netscape Messenger).
User sees the news when they log into email.
Client code must be modified and maintained.
Methods that work and are being used in other areas
Broadcast messages
This is being done by using NOVELL servers to simply broadcast messages to the users via the NOVELL server broadcast mechanism. Groups or types of users can be created and kept in files. For example, Administrative, Academic, Operations, Network, could all be types of users that we might want to broadcast messages. A broadcast message would be sent to all or any set of files (and users in the files).
Since the majority of our users are on the NOVELL network, we would be able to get a message to them easily. Lan-Coordinators may be responsible for keeping the ID files up to date.
Web server notifications
In one area 2 PC’s (low tech/price) are being used as a NOVELL web server system. They are backing up each other, so there is redundancy and 100% up time.
Instead of putting the ITS status page on our ITS web server, we could put it on a dedicated web server system such as these 2 PC’s. Very simple management, low cost/upkeep, 100% up time, and we could put our system status page there and not be affected by UNIX system maintenance, and have the web server link to system status pages on other machines.
Message of the Day
Netscape allows modification of their configuration files to allow the email start-up page to display this message. This means when a user logs into their email, they will receive the message of the day. Once the message is viewed it is not display again unless the user logs out of email and back in again.
This is a simple modification. Also the Netscape logo is customizable and the ITS/Cal Poly logo could be easily substituted. We could have this modification on every system which PC/LAN installs Netscape.
Software downloads
Along with the NOVELL license comes Zenworks. Cal Poly is licensed for this software. This allows software to be pushed out to a user or a workstation. You can create ‘types’ of user files and push certain software out to the users included in the file.
