opsschool-curriculum/datacenters/datacenters_201.rst

Datacenters 201
***************

Networking many racks
=====================

Power
=====

N+1, N+2 power
--------------

Fused vs usable
---------------

calculations
------------

Cooling
=======

N+1, N+2
---------

The cooling exhausts for redundant or load-sharing aircos should be
located in enough distance from each other that they do not influence each
others effectiveness.

Cooling efficiency
------------------

Surrounding temperature and load both influence how good an airco can work.
For example it will work best if it's not fully loaded (has to move as many
units of thermal heat as it's specified for) and if the outside temp is
somewhere below 30Deg Celsius.
(Hopefully someone knows a smart formula for that)
Their effectiveness drops if they're located too close to places where hot air
is not really moving, i.e. ground level or non-raised on a rooftop, or on a
roof section that is surrounded by higher parts of the building.
Spraying water with a sprinkler can help.

Cooling failures
----------------

Very important is to gather information on the duration of a cooling failure
so far. The surrounding datacenter structures delay a failure's effect by
absorbing heat.
That means two things:
1. At some point the walls have gained a temperature close to the ambient.
From that moment, they will not absorb more heat, resulting in a massive
increase of the rate at which the room is heating up
2. A sliding delay that means the room will take somewhat as long as it took
to heat up to cool down again, even if you already shut down everything that
consumes power.

Generally you need to assess a few things:
- is the temperature currently still a tolerable operating temperature
- how quickly is it rising
- how soon will the fault be fixed and when will a tolerable temperature be
reached
- how long is the expected time of overheating

- what damage to expect from quick heat-up and cool-down
  (i.e. tape backups in progress at the time of failure will be probably lost
  because of the stretching effects on the media. Disks can fail by the dozen at
  the same time)

- what are the absolutely critical systems running, and how much power do they use
- how many not absolutely critical systems are running, and how much power will
  it save to turn them off.
- How fast can you shut them down? (i.e. on Unix, that could mean init 2, wait,
  flip the switch)
- can you shut down enough to keep the most critical systems running?
- can you also shut down enough to massively flatten the temperature curve?

Generally, the goal is to flatten the effects so that the critical
infrastructure gets away with a very and constant slow hike up and down at
less than 2-3 deg / hr.
This is also to be respected when the cooling is fixed again, it should run
at emergency power only till the pivot is taking off the curve and the the
power needs to be constantly decreased so the ambient takes a slow fall instead
of going from 55C to 21C in 30 minutes.
(Dangers: Humidity, microscopic tears on PCB)


Physical security and common security standards compliance requirements
=======================================================================

Suggested practices
===================

Access control
--------------

Security should require an ID from anyone coming onsite, and not stop
at having you fill a form where you just write anything you want.

Techs should need to bring a ticket ID, and this ticket ID plus the tech's
name should be announced by the vendor to security, plus(!) a callback to
a defined number known to security. So, a double handshake with a token.
This is really important to avoid incidents like suddenly missing two
servers because "there were some techs that replaced it"

Most critical DCs will have security accompanying you to the server.
Some will keep the sec person around while the tech is working.
The really smart ones train their sec staff so that they will know *which*
server the tech needs to go to and *which* disk is to be replaced as per the
original ticket. (If you think the security people can't handle that, then
just ask yourself who's to blame since it does work for other DCs)