- To: slug@xxxxxxxxxxx
- Subject: Re: [SLUG] Virtualisation and DRBD
- From: Daniel Pittman <daniel@xxxxxxxxxxxx>
- Date: Wed, 25 Aug 2010 22:36:11 +1000
- Reply-to: slug@xxxxxxxxxxx
- User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
Nigel Allen <dna@xxxxxxxxxxx> writes:
> We're investigating both virtualisation of servers and High Availability at
> the same time. Currently looking at Linux-HA and DRBD (amongst others).
Keep in mind that both things — HA and virtualization — are actually pretty
hard to get working smoothly. Just ask me how it sucks when things don't work
right, after another ten hour day fighting a data corruption bug. ;)
> The idea of DRBD appeals to both me and the client as it means (or seems to
> at least) that we could add a third (off-site) machine into the equation for
> "real" DR.
DRBD is two-machine in pretty much any reasonable setup, and the performance
for an off-site spare is going to be ... interesting. If you don't have a
very low latency, high bandwidth link between the locations then you can
expect significant pain and suffering.
(The safe DRBD protocol means suffering in performance, the unsafe ones
insulate you from pain until you depend on that warm-spare and find out if it
actually got all corrupted or not before the failure.)
> What happens when we then introduce Virtualisation into the equation
> (currently have 4 x servers running Centos & Windoze Server - looking at
> virtualising them onto one single box running Centos-5).
Keep in mind that performance absolutely requires that you have
paravirtualized drivers for those kernels. That means picking something where
you have good disk and network virtual drivers ... and that probably means
"not KVM", which sucks.
> I suppose the (first) question is: If we run 4 virtualised servers (B,
> C, D, and E) on our working server "A" (complete with it's own storage),
> can we also use DRBD to sync the entire box and dice onto server A1
> (containing servers B1, C1, D1, and E1) or do we have to sync them all
Yes. Specifically: you can do either, or both, depending on how you set it
up, and on the capabilities of whatever management software you layer over the
basic virtualization tools.
> Will this idea even float? Can we achieve seamless failover with this.
Maybe. You have to be very, very clear on which two of the three attributes,
consistency, availability and partition tolerance you need, and make
absolutely, without question certain that you deliver on that.
To be clear: that means you must absolutely deliver availability (of your HA
solution) and non-partitioned connectivity, because you can't live with
inconsistency of data between the machines.
This is much harder than it sounds: you can easily work out that you have a
network cable pulled and have the entire ball of wax fall apart if you are not
very, very careful.
> If not, how would you do it.
We just moved from this to delivering iSCSI storage in the back-end, with
execute nodes that are going to start shedding local disk. This uses KVM as
the virtualization technology, but anything that can talk to raw partitions
should be fine on top of this.
This gives us two advantages: first, we can scale as broad as we want (and
have great performance) by virtue of deploying cheap storage nodes with 6TB of
usable disk, tolerance of any two disk failures, and 8.5GB of disk cache
between the nodes and the spinning rust.
Adding another of those and moving some of the load to another system is
relatively inexpensive, and can fairly easily grow with our needs; we can also
scale up storage node performance by throwing a more expensive SAS array or
server into the mix. (...or just more cache memory, at $100 a GB or so.)
The second is that we can deliver reliability without complicating the storage
nodes: a virtual machine can use Linux software RAID to mirror, stripe, or
otherwise combine multiple iSCSI LUNs inside the virtual machine.
This gives us similar performance and redundancy benefits to using local
storage to back those devices, including the ability to lose a storage node
and have the machine continue to work.
This does mean that, unlike a design where replication is directly between
storage nodes, we send writes out N times for N targets from the processing
node — but that isn't any more bytes over the network overall, and we are not
short on network capacity. (Plus, dedicated storage links on a bunch of the
busier machines make this work without interference with public services.)
✣ Daniel Pittman ✉ daniel@xxxxxxxxxxxx ☎ +61 401 155 707
♽ made with 100 percent post-consumer electrons