mirror of
https://github.com/opsschool/curriculum.git
synced 2026-01-15 12:15:03 +00:00
110 lines
4.1 KiB
ReStructuredText
110 lines
4.1 KiB
ReStructuredText
Datacenters 201
|
|
***************
|
|
|
|
Networking many racks
|
|
=====================
|
|
|
|
Power
|
|
=====
|
|
|
|
N+1, N+2 power
|
|
--------------
|
|
|
|
Fused vs usable
|
|
---------------
|
|
|
|
calculations
|
|
------------
|
|
|
|
Cooling
|
|
=======
|
|
|
|
N+1, N+2
|
|
---------
|
|
|
|
The cooling exhausts for redundant or load-sharing aircos should be
|
|
located in enough distance from each other that they do not influence each
|
|
others effectiveness.
|
|
|
|
Cooling efficiency
|
|
------------------
|
|
|
|
Surrounding temperature and load both influence how good an airco can work.
|
|
For example it will work best if it's not fully loaded (has to move as many
|
|
units of thermal heat as it's specified for) and if the outside temp is
|
|
somewhere below 30Deg Celsius.
|
|
(Hopefully someone knows a smart formula for that)
|
|
Their effectiveness drops if they're located too close to places where hot air
|
|
is not really moving, i.e. ground level or non-raised on a rooftop, or on a
|
|
roof section that is surrounded by higher parts of the building.
|
|
Spraying water with a sprinkler can help.
|
|
|
|
Cooling failures
|
|
----------------
|
|
|
|
Very important is to gather information on the duration of a cooling failure
|
|
so far. The surrounding datacenter structures delay a failure's effect by
|
|
absorbing heat.
|
|
That means two things:
|
|
1. At some point the walls have gained a temperature close to the ambient.
|
|
From that moment, they will not absorb more heat, resulting in a massive
|
|
increase of the rate at which the room is heating up
|
|
2. A sliding delay that means the room will take somewhat as long as it took
|
|
to heat up to cool down again, even if you already shut down everything that
|
|
consumes power.
|
|
|
|
Generally you need to assess a few things:
|
|
- is the temperature currently still a tolerable operating temperature
|
|
- how quickly is it rising
|
|
- how soon will the fault be fixed and when will a tolerable temperature be
|
|
reached
|
|
- how long is the expected time of overheating
|
|
|
|
- what damage to expect from quick heat-up and cool-down
|
|
(i.e. tape backups in progress at the time of failure will be probably lost
|
|
because of the stretching effects on the media. Disks can fail by the dozen at
|
|
the same time)
|
|
|
|
- what are the absolutely critical systems running, and how much power do they use
|
|
- how many not absolutely critical systems are running, and how much power will
|
|
it save to turn them off.
|
|
- How fast can you shut them down? (i.e. on Unix, that could mean init 2, wait,
|
|
flip the switch)
|
|
- can you shut down enough to keep the most critical systems running?
|
|
- can you also shut down enough to massively flatten the temperature curve?
|
|
|
|
Generally, the goal is to flatten the effects so that the critical
|
|
infrastructure gets away with a very and constant slow hike up and down at
|
|
less than 2-3 deg / hr.
|
|
This is also to be respected when the cooling is fixed again, it should run
|
|
at emergency power only till the pivot is taking off the curve and the the
|
|
power needs to be constantly decreased so the ambient takes a slow fall instead
|
|
of going from 55C to 21C in 30 minutes.
|
|
(Dangers: Humidity, microscopic tears on PCB)
|
|
|
|
|
|
Physical security and common security standards compliance requirements
|
|
=======================================================================
|
|
|
|
Suggested practices
|
|
===================
|
|
|
|
Access control
|
|
--------------
|
|
|
|
Security should require an ID from anyone coming onsite, and not stop
|
|
at having you fill a form where you just write anything you want.
|
|
|
|
Techs should need to bring a ticket ID, and this ticket ID plus the tech's
|
|
name should be announced by the vendor to security, plus(!) a callback to
|
|
a defined number known to security. So, a double handshake with a token.
|
|
This is really important to avoid incidents like suddenly missing two
|
|
servers because "there were some techs that replaced it"
|
|
|
|
Most critical DCs will have security accompanying you to the server.
|
|
Some will keep the sec person around while the tech is working.
|
|
The really smart ones train their sec staff so that they will know *which*
|
|
server the tech needs to go to and *which* disk is to be replaced as per the
|
|
original ticket. (If you think the security people can't handle that, then
|
|
just ask yourself who's to blame since it does work for other DCs)
|