Monitoring and issue escalation

x

Error message

Notice: Only variables should be passed by reference in include() (line 191 of sites/all/themes/UCMerced/templates/partials/_header.tpl.php).
Notice: Only variables should be passed by reference in include() (line 417 of sites/all/themes/UCMerced/templates/partials/_header.tpl.php).

1. Hardware monitoring

Contigex (our hosting provider) provides continuous monitoring for the servers. The monitoring generates internal alerts that go directly to them; we should never need to be looped in.

2. Site monitoring

Contigex and UC Merced IT independently monitor this representative list of sites:

admissions.ucmerced.edu
cattracks.ucmerced.edu
engineering.ucmerced.edu
financialaid.ucmerced.edu
graduatedivision.ucmerced.edu
housing.ucmerced.edu
hr.ucmerced.edu
it.ucmerced.edu
library.ucmerced.edu
naturalsciences.ucmerced.edu
news.ucmerced.edu
registrar.ucmerced.edu
ssha.ucmerced.edu
www.ucmerced.edu

These sites were chosen because they're high-traffic and mission critical. It's extremely rare for an outage to affect just one site (because almost all code is shared across sites), so this representative list is sufficient to let us know that something wrong with our Drupal installation.

Contigex alerts are automatically turned into internal tickets and addressed by their sysadmins (I also get them).

UC Merced alerts go to the following people:

Joseph Garcia
Joshua Andrade
Steven Powers
Bryan Green (IT)

3. Issue escalation

Contigex provides 24/7 support. You can see the Service Level Agreement for issue resolution times here (short version: critical outages should be addressed within 1/2 hour). Most issues should be handled automatically without having to loop us in. In general, if we have a production issue that requires our attention, this indicates a problem with our processes that we need to correct.

If Contigex is unable to fix an issue internally, they will escalate it to us (right now, "us" is just me; we should probably expand that list).

Independently, if we have outage notifications that aren't immediately resolved, the call order is:

Joseph Garcia
Joshua Andrade
Steven Powers
Bryan Green (IT)
everyone else with Contigex access (approximately a dozen people, look under "Users")

(i.e. call the first person; if you can't get them, call the second person, etc.). In practice, the people on this list are going to just provide Contigex with information about what recently changed to assist them with their troubleshooting. Bryan and I also have admin access to the servers, so we can perform more extensive investigations.

Kent Carpenter is the IT Services Manager and serves as our official point of contact. However, IT has no direct responsibility for the websites. Kent maintains an internal Service Now page for the IT staff that documents everything here so that the IT help desk knows who to contact in an emergency.

4. Upgrades and planned outages

We regularly update our systems. Roughly speaking, the updates fall into four buckets:

Normal server security updates (no reboot). Contigex will do these automatically during their maintenance window (12 am - 5 am)
Exceptional server security updates (reboot required). Contigex will coordinate with us to pick an upgrade time/date.
Drupal core updates: We will request a time/date for Contigex to install these (typically during a weekend maintenance window, unless it's an emergency update).
Drupal module/theme/feature updates: We install these ourselves during an evening or weekend maintenance window.

Our production environment contains two parallel webserver stacks. If an upgrade requires an outage or reboot, Contigex will upgrade one stack at a time so that the websites remain available.

We perform all code (module/theme/feature) upgrades in our development environment first and test them prior to deploying them to production. Development is a close mirror of production so this provides us with confidence that the updates will work in production.

I don't communicate any of this to the campus community because the campus community doesn't want to get 3-10 emails a week from me. If there are particular sites or stakeholders that are likely to be affected by a specific change, I may communicate with them in a targeted way.

Management

Monitoring and issue escalation

Error message

1. Hardware monitoring

2. Site monitoring

3. Issue escalation

4. Upgrades and planned outages

Additional Links

Academics

Administration