BootstrappingOpenStack CloudsPlatforms and Infrastructurefor Hyperscale EnvironmentsA Dell Technical White PaperAuthored by Rob Hirschfeld and Greg AlthausContributions from Bret Piatt, Director of Product Management, Rackspace CloudBuildersBring open APIs and best practices to cloud operations.
Bootstrapping OpenStack CloudsTHIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICALINACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.Table of ContentsExecutive Summary . 3Selecting a Platform . 3Fundamental Hyperscale Design Patterns .4Fault Zones .4Flatness at the Edges . 5Choosing Hardware . 5Network Configuration. 7Design Guidelines .8Operations Infrastructure.10The Administration Server .10Core Services.10Provisioning . 12Monitoring . 12Beyond Bootstrapping: Laying Down OpenStack . 13Deploying Storage (Swift). 13Deploying Compute (Nova) . 13Other Services . 14Key Takeaways . 15To Learn More . 152
Bootstrapping OpenStack CloudsExecutive SummaryBringing a cloud infrastructure online can be a daunting bootstrapping challenge. Beforehanging out a shingle as a private or public service provider, you must select a platform, acquirehardware, configure your network, set up operations services, and integrate it all together.Those are a lot of moving parts before you have even installed a sellable application.This white paper walks you through the decision process to get started with an open sourcecloud infrastructure based on OpenStack and Dell Power Edge C class hardware. The figurebelow serves as a roadmap for components that we’ll cover: red for OpenStack, blue forhardware, green for operations and configuration, and white for topics reserved for future whitepapers from Dell. At the end, you’ll be ready to design your own trial system that will serve asthe foundation of your hyperscale cloud.Getting StartedIf you want a seriousleg up toward aworking cloud, Dellis building a gettingstarted kit aroundthe principlesdiscussed in thiswhite paper.Selecting a PlatformThis white paper assumes that you’ve selected OpenStack Bexar Release on Ubuntu 10.10 asyour infrastructure platform. While the concepts hold for any hyperscale cloud infrastructure,it’s helpful to focus on a single platform for this reference. OpenStack is particularly interestingas an open source cloud because it: Supports the two top public compute cloud application programming interfaces, orAPIs (Amazon and Rackspace) Supports the two top open source hypervisors (KVM and Xen) Can run guests using Windows, Linux, or other x86-based operating systems Will be deployed at hyperscale ( 1000 nodes) at multiple sites (NASA, Rackspace, andothers) Is truly open and community-developed allowing fixes, support, and extend features asneeded Has a significant international community adding new featuresOpenStack represents an innovator’s paradise: it offers support for existing ecosystems andopportunities to influence future direction, and it provides the foundational components for acloud service. By building on this foundation, you can create a complete cloud solution. Wewill discuss added services and extensions at the end of this paper.3This kit includes abase hardwarespecification andtools that take youfrom unboxingservers to running ausable OpenStackcloud in hours.Email us [email protected] you are interestedin learning more.
Bootstrapping OpenStack CloudsInterested users are now reaching out to Dell for help in test driving OpenStack as an opensource cloud. Dell’s OpenStack getting-started kit specifically targets trials by reducing setuptime and lessening the learning curve to configure a base OpenStack cloud.There are three primary components of OpenStack: Compute (Nova), Object Storage (Swift),and an Image Service (Glance). Our focus is on preparing an environment to run OpenStack.You will need additional references to learn everything you need to manually complete anOpenStack install.Note: Dell is actively seeking customers interested in conducting a proof-of-concept (PoC)using OpenStack. A PoC engagement with Dell will involve many of the configurations andservices discussed in this white paper. You may also wish to take advantage of services availablefrom Rackspace Cloud Builders. Email us at [email protected] if you are interested inlearning more.Fundamental Hyperscale Design PatternsFault ZonesBuilding a hyperscale cloud requires a different mindset (we like to call it “revolutionary”)compared to a traditional enterprise virtualized infrastructure. This means driving a degree ofsimplicity, homogeneity, and density that is beyond most enterprise systems.The core lesson of these large systems is that redundancy moves from the hardware into thesoftware and applications. In fact, the expectation of failure is built into the system as a keyassumption because daily failures are a fact of life when you have thousands of servers.To achieve scale, individual components intentionally lack network, power, and disk redundancy.Servers are configured with single network paths, single power supplies, and non-RAIDed drives(aka JBOD, or “just a bunch of disks”). That means that a power distribution unit (PDU) or rackswitch failure will take down a handful of servers. To accommodate this risk, the system isdivided into what we call “fault zones.” Applications and data are striped across fault zones(similar to data stripping on a RAID) to isolate the impact of multiple component failures.The benefits of this design approach are significant:4What isa ”hyperscalecloud” ?Hyperscale systemsare designed tooperate thousandsof servers under asinglemanagementinfrastructure. Thescope of thesesystems requires adifferentmanagementparadigm in whichhardware faults arecommon, manualsteps are notpractical, and smallcosts add up tolarge economicimpacts.An example ofsmall costs addingto big impacts:changing a sixdrive array fromRAID 5 to RAID 10would reduce totalstorage by 40percent. Putanother way, you’dhave to buy 66percent more disk(10 instead of 6drives) for thesame total storage!
Bootstrapping OpenStack Clouds The ability to choose non-redundant components (disk, server and network) with alower total cost of ownership (TCO) Simpler network routing and configuration Simpler physical data center layouts Higher density because capacity is not lost to redundant disk, network, and power Predictable and streamlined setups and deployment processesIt is important to point out that core networking is still constructed with redundant andhardware-fault-tolerant paths.As a consumer of this infrastructure approach, applications must take a fault-zone-tolerantdeployment model. We have discussed this in detail in blogs posts and presentations aboutapplication striping using redundant arrays of inexpensive nodes (RAIN).Flatness at the Edges“Flatness at the edges” is one of the guiding principles of hyperscale cloud designs. Flatnessmeans that cloud infrastructure avoids creating tiers where possible. For example, having ablade in a frame aggregating networking that is connected to a SAN via a VLAN is a tiered designin which the components are vertically coupled. A single node with local disk connecteddirectly to the switch has all the same components but in a single “flat” layer. Edges are thebottom tier (or “leaves”) of the cloud. Being flat creates a lot of edges because most of thecomponents are self-contained. To scale and reduce complexity, clouds must rely on the edgesto make independent decisions, such as how to route network traffic, where to replicate data, orwhen to throttle virtual machines (VMs). We are effectively distributing an intelligence overheadtax on each component of the cloud rather than relying on a “centralized overcloud” to rulethem all.Note: An anti-example of edge design is using VLANs to segment tenants because VLANs (alimited resource) require configuration at the switching tier to manage traffic generated by anedge component.Choosing HardwareChoosing cloud hardware requires committing to a fault-tolerance strategy that matches youroperations model. For hyperscale clouds, our customers demand highly modular and densesolutions. Just as a RAID system focuses on using interchangeable commodity disks, clouds arebuilt using interchangeable utility servers. The logic is that you will have sufficient scale tocreate redundancy and, more importantly, sufficient modularity to grow incrementally.Modularity is a critical value to help reduce complexity. When clouds are measured in thehundreds of nodes, it is difficult to manage nodes that are linked in groups of six to a dedicatedSAN and then connected with eight or more pairs of teamed network interface controllers(NICs) to different cross-connected switches. If just describing the system is difficult thenimagine trying to design, document, and maintain it.Fundamentally, hyperscale clouds have less shared physical infrastructure by design becauseshared physical infrastructure is harder to configure, manage, and troubleshoot. It also has theunfortunate side effect of causing broader systems outages. While the individual componentsmay be more likely to fail in this model, the impact of those failures is more isolated, smaller,and much easier to correct quickly.In our experience, nodes fall into one of four performance categories:5Concepts like“ Flatness at theEdges” are basedon operatinghyperscale clouds.In many cases,hyperscale designrequirements arecontrary totraditional datacenter objectivesbecause they havedifferent coreassumptions.Dell Data CenterSolutions (DCS)group has beenhelping customersbuild clouds at thisscale for years. Theinnovations fromthese hyperscaledata centers havebegun to trickledown and can nowbe successfullyapplied at amoderate scale.
Bootstrapping OpenStack Clouds Compute solutions are not as common for virtual manchine-based clouds but typicalfor some analytics systems (interestingly, many analytics are more disk- and networkbound). In practice, cloud applications are more likely to scale out than up. Storage solutions should be treated with caution. Use IP-network-based iSCSI SAN orNAS storage to address these cases because it’s much easier to centralize big data thandrag all of it to your local nodes. Note: If you have a solution that needs really bigstorage and lots of VMs, then it may not be a good cloud application. Network solutions may really be compute-heavy systems in disguise. Unless you arepacking a lot of RAM and CPU into your systems, it’s unlikely that you will hit the wallon networking bandwidth (more about this later). Remote storage is a primary driverfor needing more networking capacity, so you may solve your networking constraintsby using more local disk. Balanced solutions are a good compromise because even the most basic VMplacement can distribute VMs to level resource use. This is likely to become eveneasier when live migration is a standard feature (expected before the OpenStackCactus release)“To RAID or not toRAID, that is thequestion.”Using hardware RAIDon compute nodescan provide anadditional safety netfor customer data.This is importantwhen you do notexpect (or force)customers to scaleon multiple nodes oruse network storagefor critical data.Comparing these four categories to available Dell PowerEdge C server models, the balancedfocus server seems to handle the broadest range of applications for compute while the storagenode is the best choice for storage. We recommend the balanced node for trial systemsbecause it can be easily repurposed anywhere else as your cloud grows.FocusComputeBalancedStorageNetworkDell ModelPEC 6100 4 sleds(pictured above)PEC 6100 2 sledsPEC 2100PEC 2100 10 Gb NICsRack 222168129648481224243:43:12:11:21:2 2:1Assumptions:48 gigabytes (GB) per node (actual RAM can be higher or lower)6 2.5-inch drives boost spindle counts. 3.5-inch drives offer more capacity and lesscost but lower IOPS (input/output operations per second). Disk/Core assumes unRAIDed drives for comparison. Counts decrease if RAID systemsare used.The downside ofRAID is that itreduces storagecapacity whileadding cost andcomplexity. RAIDmay alsounderperform JBODconfigurations if VMI/O is not uniform.Ultimately, your Opscapability and risktolerance determinesif RAID is the right fit.
Bootstrapping OpenStack Clouds Four NICs per node, as per guidance in the “Network Configuration” section of thispaper.The key to selecting hardware is to determine your target ratios. For example, if you areplanning compute to have one core per VM (a conservative estimate) then a balanced systemwould net nearly one spindle per VM. That effectively creates a dedicated I/O channel for eachVM and gives plenty of storage. While you may target higher densities, it’s useful to understandthat your one core class of VMs has nearly dedicated resources. Flatness at the edgesencourages this type of isolation at the VM level because it eliminates interependencies at themaximum possible grainularity.When you look at storage hardware, it can be difficult to find a high enough disk-to-core ratio.For solutions like Swift, you may want to consider the most power-efficient CPUs and largestdisks. Object stores are often fronted with a cache so that high-demand files do not hit theactual storage nodes.So let’s look at the concept of a mixed storage and compute system. In that model, the samenodes perform both compute and storage functions. For that configuration, the networkoptimized node seems to be the best compromise; however, we consistently find that amixed-use node has too many compromises and ends up being more expensive—10-gigabit(Gb) networking has a hefty premium still—compared to a heterogeneous system. There is oneexception: we recommend a mixed-use system for small-scale pilots because it gives you themost flexibility while you are learning to use your cloud infrastructure.As with any design, the challenge is to prevent exceptions from forcing suboptimal designchanges. For example, the need to host some 100-Gb disk VMs should not force the entireinfrastructure into a storage-heavy pattern. It is likely a better design to assume 20-Gb VMs onfast local disk and set up a single shared iSCSI SAN or NAS target to handle the exceptions assecondary drives. For service providers, these exceptions become premium features.Network ConfigurationIt is virtually impossible to overstate the importance of networking for hyperscale clouds, butimportance should not translate into complexity. The key to cloud networking is to simplifyand flatten. Achieving this design objective requires making choices that are contrary toenterprise network topologies.A Typical TopologyBest practice for hyperscale clouds calls forthree logical primary networks with a possiblefourth. In practice, these networks are usuallymapped directly to physical NICs; however,that mapping is not required.1. The administration network connects the cloud infrastructure management to thenodes that run the cloud workloads. This network is restricted and not accessible toVMs. See the “Operations Infrastructure” section for more information.2. The internal network provides connectivity between VMs and services (e.g. the objectstore) within the cloud. This network typically carries the bulk of the cloud traffic, andcustomers are not charged for bandwidth consumed internally.3. The external network connects VMs to the Internet and is metered so use can becharged to the customer.7Hyperscale cloudsare fundamentallymultitenant . Theability to mixunrelated worktogether enableslarge-scale cloudload balancing. Theexpectation ofdynamic demandpairs with the featureof resourceelasticity .Our multitenantassumption creates arequirementparadox : We needboth isolation andaggressiveintermixing. Itshould be nosurprise that theanswer isvirtualization ofcompute, network,and storage.
Bootstrapping OpenStack Clouds4. Use of a storage network is recommendedwhen using centralized storage to isolate theimpact of large transfers on other networks. Ifstorage traffic is isolated from the VMs then itmay be possible to combine storage with theadministration network.There are several reasons for segmenting the networks but the primary one is bandwidthdistribution. We want to ensure that traffic on our money network (external) is not disrupted byactivity on the other networks. Segmentation also allows for better IP management.Surprisingly, security is not a motivation for segmentation. In a multitenant cloud, we mustassume that untrusted users can penetrate to VMs that have access to the internal network;consequently, we must rely on better methods to isolate intruders.As 10-Gb networking becomes more affordable, we expect to see a trend to map these logicalnetworks into one or two physical 10-Gb NICs. Dropping to a single interface is good design1since a single 10-Gb port can carry more than twice the traffic of the four-NIC configuration.In addition, the single high-speed NIC design has more elastic bandwidth: one network canburst to consume up to 80 percent of the capacity and still leave the other networks with 1-Gbof bandwidth. Remember that the Admin network must access the motherboard IntelligentPlatform Management Interface (IPMI) and management, and likely cannot ride on secondaryinterfaces.Design GuidelinesSince there is no one-size-fits-all topology, we will outline some basic rules for constructingcloud networks, presented in priority order. Following these rules will help ensure you have asolid cloud connectivity foundation.Rule 1: Cost matters.Creating unused capacity wastes money. Idle backup links and under-subscribed bandwidthmore than double costs. Adding complexity also costs money. Managing overlapping VLANs,complex routing rules, and sophisticated active-active paths injects the need for manual laboror expensive management tools. In an inelastic enterprise data center, the investment in fullyAbout FalseRedundantWhile on the surface,creating a one-toone switch-to-servermapping may seemrisky, it is actuallymore than threetimes more reliablethan spreading theserver’s network loadto four switches.Unless you haveredundancy, addingcomponents to asystem will make itless reliable.1While 10 Gb NICs do provide more bandwidth, they have limited packet count that may prevent them fromhandling the full capacity. For example: VMs sending heavy loads with small packets may saturate a single10-Gb NIC, where multiple 1-Gb NICs could handle the load.8
Bootstrapping OpenStack Cloudsredundant networks (usually requiring eight or more interfaces connecting to four interleavedswitches) makes sense; however, cloud scale makes it economically unfeasible to buy, install,and manage the extra (and more expensive) equipment.A hyperscale cloud network can use simpler switches because we have embraced fault zones.In our recommended configuration, each server connects to just one switch. This means youneed fewer switches, fewer ports, less sophisticated paths, and even shorter wires. Morecomponents touching a node make that node less reliable. The only way to increase reliability isto add expensive and complex redundancy.Hyperscale clouds choose system-levelredundancy plus more low-cost resources as a way to improve fault tolerance.Rule 2: Keep your network flat.Four thousand ninety six sounds like a big number. That is the maximum number of VLANs thatmost networks will support without forcing you to get creative. You will need some VLANs tocreate logical networks and manage broadcast domains; however, using VLANs to segmenttenant traffic will not scale. Our current density recommendation is 36 nodes per rack. If eachnode supports 32 VMs (4 per core) then each rack will sustain 1,152 VMs and require anallocation of nearly 2,500 IPs addresses. Managing tiered networks and VLANs for systems atthat density is not practical; consequently, cloud networks tend to be as flat as possible.Our cloud network reference designs use stacking to create a logical top-of-rack switch:stacking uses short distance 14-Gb networking that effectively merges all the switches. Thisallows for extremely fast and simple communication between nodes in the rack, and stackedswitches can share 10-Gb uplinks to core routers per switch. This way, each switch can still bean isolated fault zone without paying the price of routing all traffic to core.Rule 3: Filter at the edge.Since VLANs do not scale, we need another way to prevent unwanted cross-tenantcommunication. The solution is to edge filter traffic at the node level. This requires the cloudmanagement system (OpenStack Nova) to set up network access rules for each VM that itdeploys. The rules must allow VMs from the same tenant to talk to each other while blockingother traffic. Currently, Linux IPTables is the tool of choice for this filtering, but look for newapproaches using OpenFlow or Open vSwitch.Rule 4: Design fault zones.You should be able to easily identify fault zones in your network topology. Remember thatfault zones are used to both isolate the impact of failures and simplify your design. Your lowestpaid data center tech and highly automated cloud management system must be able tounderstand the topology.Rule 5: Plan for local traffic.Cloud applications are more likely to be chatty scale-out architectures than traditional tiereddesigns. While this delivers reliability by spreading work across fault zones, it creates a lot ofinternal network traffic. If this internal traffic has to route between switches over the corenetwork then you can oversubscribe your core bandwidth and impact externalcommunications. Luckily, it is possible to predict internal communication because it is mainlybetween VMs for each tenant. This concern can be mitigated with additional outbound links,930-Second Rule ofComplexityIf you’ve studiedcomputer sciencethen you know thereare algorithms thatcalculate“complexity.”Unfortunately, thesehave little practicaluse for data centeroperators.Our complexity ruledoes not require aPhD:If it takes more than30 seconds to pickout what would beimpacted by a devicefailure then yourdesign is toocomplex.
Bootstrapping OpenStack Cloudsstacking top-of-rack switches (see Rule 1 above), and clustering a tenant so most of its trafficaggregates into the same core switches.Rule 6: Offer load balancers.Our final rule helps enable good architecture hygiene by the cloud users. Making load balancersinexpensive and easy to use encourages customers to scale out their applications. Cloudproviders need scaled-out applications to span fault zones and mitigate a hyperscale cloud’shigher risk of edge failures. Several public clouds integrate load balancing as a core service ormake pre-configured load balancer VMs easily available. If you are not encouraging customersto scale out their applications than you should plan to scale out your help desk and Operations(Ops) team.Operations InfrastructureOne of our critical lessons learned about cloud bootstrapping is that Ops capabilities are just asfundamental to success as hardware and software. You will need the same basic core Opscomponents whether you are planning a 1,000-node public cloud or a six-node lab. Theseservices build upwards in layers from core network services to monitoring and then toprovisioning and access.The Administration ServerBefore we jump into specific services to deploy, it’s important to allocate a small fraction (onefor each 100 nodes) of your infrastructure as an Administration (Admin) service. In all of ourdeployment scripts, this server is the first one configured and provides the operations servicesthat the rest of the infrastructure relies on. This server is not the one running your external APIsor portals; it is strictly for internal infrastructure management. During bootstrapping, it is theimage server and deployment manager. Postbootstrapping, it can be your bastion host andmonitoring system. Even in our smallest systems, we make sure to dedicate a server for Adminbecause it makes operating the cloud substantially easier. As we’ll explore below, the Adminserver is a real workhorse.Core ServicesCore services enable the most basic access and coordination of the infrastructure. Theseservices are essential to cloud operations because the rest of the infrastructure anticipates adata center level of Ops capability. Unlike a single-node targeted SQL server, cloud softwareexpects to operate in an Internet data center with all the services and network connectivity thatcomes with being “on the net.”Here’s the list of core ticandDHCPWe’ve found thatDHCP on the Adminnetwork allows forcentraladministration ofnode addresses andcan be used toconveyconfigurationinformation beyondWe have gone backand forth aboutusing DHCP duringnormal operations.Initially, we werereluctant tointroduce yetanotherdependency to setup and maintain.Ultimately, weembraced DHCPfor the Adminnetwork because itcan be used forboth delivery bootup configurationand to sustain ourPXE integration.Now that we havethe infrastructurein place, we useDHCP and PXE toautomate BIOSupdates by bootingthrough a patchimage.
Bootstrapping OpenStack CloudsServiceNameCommentsIP address. We usestatic addressing onthe other segmentsto avoid collisionswith VM-focusednetworkmanagementservices.Nodes must be areable to resolvenames forthemselves, othernodes, the admin,and clients. Using acloud DNS servereliminates externaldependencies.Ultimately, cloudsgenerate a lot ofDNS activity andneed to be able tocontrol nameswithin their domain.Domain NamesDNSTimeSynchronizationNTPSince the systemsare generatingcertificates forcommunications,even small time driftcan make it difficultto troubleshootissues.Network AccessBastionHostRecommended: ToPXERecommended:Network Installcreate (or resolve)network isolation, abastion host can beconfigured to limitaccess to the adminnetwork(production) orcreate access to theproductionnetworks (restrictedlab).Beyond lab installs,PXE is requiredbecause it’simpractical to installbits on largenumber of serversfrom media.11
Bootstrapping OpenStack CloudsServiceNameCommentsOutbound EmailSMTPRecommended:Most cloudcomponents willsend email for alertsor account creation.Not being able tosend email maycause hangs orerrors so it’sadvisable to plan forrouting email.ProvisioningThe most obvious challenge for hyperscale is the degree of repetition required to bring systemsonline (aka provision) and then maintain their patch levels. This is especially challenging fordynamic projects like OpenStack where new features or patches may surface at any time. In theDell cloud development labs, we plan for a weekly rebuild of the entire system.To keep up with these installs, we invest in learning deployment tools like Puppet and Chef. Ourcloud automation leverages a Chef server on the Admin and Chef clients are included on thenode images. After the operating system has been laid down by PXE on a node, the Chef clientretrieves the node’s specific configuration from the server. The configuration scripts (“recipes”and “cookbooks” in Chef vernacular) not only install the correct packages, they also lay downthe customized configuration and data files needed for that specific node. For example, a Swiftdata node must be given its correct ring configuration file.To truly bootstrap a cloud, deployment automation must be understood as an interconnectedsystem. We call this description a “meta configuration.” Ops must make informed decisionsabout which drives belong to each Swift rung and which Nova nodes belong to each scheduler.To help simplify trial systems, our cloud installer makes recommendations based on yourspecific infrastructure. Ultimately, you must take the time to map the dependencies and faultzones of your infrastructure because each cloud is unique.MonitoringOnce the nodes are provisioned, Ops must keep the system running. With hundreds of nodesand thousands of spindles, failures and performance collisions are normal occurrences. Ofcourse, c
Bootstrapping OpenStack Clouds 4 Interested users are now reaching out to Dell for help in test driving OpenStack as an open source cloud. Dell’s OpenStack getting-started kit specifically targets trials by reducing setup time and lessening the learning cu