↓ Archives ↓

Introducing: Oh My Vagrant!

If you’re a reader of my code or of this blog, it’s no secret that I hack on a lot of puppet and vagrant. Recently I’ve fooled around with a bit of docker, too. I realized that the vagrant, environments I built for puppet-gluster and puppet-ipa needed to be generalized, and they needed new features too. Therefore…

Introducing: Oh My Vagrant!

Oh My Vagrant is an attempt to provide an easy to use development environment so that you can be up and hacking quickly, and focusing on the real devops problems. The README explains my choice of project name.

Prerequisites:

I use a Fedora 20 laptop with vagrant-libvirt. Efforts are underway to create an RPM of vagrant-libvirt, but in the meantime you’ll have to read: Vagrant on Fedora with libvirt (reprise). This should work with other distributions too, but I don’t test them very often. Please step up and help test :)

The bits:

First clone the oh-my-vagrant repository and look inside:

git clone --recursive https://github.com/purpleidea/oh-my-vagrant
cd oh-my-vagrant/vagrant/

The included Vagrantfile is the current heart of this project. You’re welcome to use it as a template and edit it directly, or you can use the facilities it provides. I’d recommend starting with the latter, which I’ll walk you through now.

Getting started:

Start by running vagrant status (vs) and taking a look at the vagrant.yaml file that appears.

james@computer:/oh-my-vagrant/vagrant$ ls
Dockerfile  puppet/  Vagrantfile
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)

The Libvirt domain is not created. Run `vagrant up` to create it.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$

Here you’ll see the list of resultant machines that vagrant thinks is defined (currently just template1), and a bunch of different settings in YAML format. The values of these settings help define the vagrant environment that you’ll be hacking in.

Changing settings:

The settings exist so that your vagrant environment is dynamic and can be changed quickly. You can change the settings by editing the vagrant.yaml file. They will be used by vagrant when it runs. You can also change them at runtime with --vagrant-foo flags. Running a vagrant status will show you how vagrant currently sees the environment. Let’s change the number of machines that are defined. Note the location of the --vagrant-count flag and how it doesn’t work when positioned incorrectly.

james@computer:/oh-my-vagrant/vagrant$ vagrant status --vagrant-count=4
An invalid option was specified. The help for this command
is available below.

Usage: vagrant status [name]
    -h, --help                       Print this help
james@computer:/oh-my-vagrant/vagrant$ vagrant --vagrant-count=4 status
Current machine states:

template1                 not created (libvirt)
template2                 not created (libvirt)
template3                 not created (libvirt)
template4                 not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms: []
:namespace: template
:count: 4
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$

As you can see in the above example, changing the count variable to 4, causes vagrant to see a possible four machines in the vagrant environment. You can change as many of these parameters at a time by using the --vagrant- flags, or you can edit the vagrant.yaml file. The latter is much easier and more expressive, in particular for expressing complex data types. The former is much more powerful when building one-liners, such as:

vagrant --vagrant-count=8 --vagrant-namespace=gluster up gluster{1..8}

which should bring up eight hosts in parallel, named gluster1 to gluster8.

Other VM’s:

Since one often wants to be more expressive in machine naming and heterogeneity of machine type, you can specify a list of machines to define in the vagrant.yaml file vms array. If you’d rather define these machines in the Vagrantfile itself, you can also set them up in the vms array defined there. It is empty by default, but it is easy to uncomment out one of the many examples. These will be used as the defaults if nothing else overrides the selection in the vagrant.yaml file. I’ve uncommented a few to show you this functionality:

james@computer:/oh-my-vagrant/vagrant$ grep example[124] Vagrantfile 
    {:name => 'example1', :docker => true, :puppet => true, },    # example1
    {:name => 'example2', :docker => ['centos', 'fedora'], },    # example2
    {:name => 'example4', :image => 'centos-6', :puppet => true, },    # example4
james@computer:/oh-my-vagrant/vagrant$ rm vagrant.yaml # note that I remove the old settings
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example2                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms:
- :name: example1
  :docker: true
  :puppet: true
- :name: example2
  :docker:
  - centos
  - fedora
- :name: example4
  :image: centos-6
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vim vagrant.yaml # edit vagrant.yaml file...
james@computer:/oh-my-vagrant/vagrant$ cat vagrant.yaml 
---
:domain: example.com
:network: 192.168.123.0/24
:image: centos-7.0
:sync: rsync
:puppet: false
:docker: false
:cachier: false
:vms:
- :name: example1
  :docker: true
  :puppet: true
- :name: example4
  :image: centos-7.0
  :puppet: true
:namespace: template
:count: 1
:username: ''
:password: ''
:poolid: []
:repos: []
james@computer:/oh-my-vagrant/vagrant$ vs
Current machine states:

template1                 not created (libvirt)
example1                  not created (libvirt)
example4                  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.
james@computer:/oh-my-vagrant/vagrant$

The above output might seem a little long, but if you try these steps out in your terminal, you should get a hang of it fairly quickly. If you poke around in the Vagrantfile, you should see the format of the vms array. Each element in the array should be a dictionary, where the keys correspond to the flags you wish to set. Look at the examples if you need help with the formatting.

Other settings:

As you saw, other settings are available. There are a few notable ones that are worth mentioning. This will also help explain some of the other features that this Vagrantfile provides.

  • domain: This sets the domain part of each vm’s FQDN. The default is example.com, which should work for most environments, but you’re welcome to change this as you see fit.
  • network: This sets the network that is used for the vm’s. You should pick a network/cidr that doesn’t conflict with any other networks on your machine. This is particularly useful when you have multiple vagrant environments hosted off of the same laptop.
  • image: This is the default base image to use for each machine. It can be overridden per-machine in the vm’s list of dictionaries.
  • sync: This is the sync type used for vagrant. rsync is the default and works in all environments. If you’d prefer to fight with the nfs mounts, or try out 9p, both those options are available too.
  • puppet: This option enables or disables integration with puppet. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • docker: This option enables and lists the docker images to set up per vm. It is possible to override this per machine. This functionality will be expanded in a future version of Oh My Vagrant.
  • namespace: This sets the namespace that your Vagrantfile operates in. This value is used as a prefix for the numbered vm’s, as the libvirt network name, and as the primary puppet module to execute.

More on the docker option:

For now, if you specify a list of docker images, they will be automatically pulled into your vm environment. It is recommended that you pre-cache them in an existing base image to save bandwidth. Custom base vagrant images can be easily be built with vagrant-builder, but this process is currently undocumented.

I’ll try to write-up a post on this process if there are enough requests. To keep you busy in the meantime, I’ve published a CentOS 7 vagrant base image that includes docker images for CentOS and Fedora. It is being graciously hosted by the GlusterFS community.

What other magic does this all do?

There is a certain amount of magic glue that happens behind the scenes. Here’s a list of some of it:

  • Idempotent /etc/hosts based DNS
  • Easy docker base image installation
  • IP address calculations and assignment with ipaddr
  • Clever cleanup on ‘vagrant destroy
  • Vagrant docker base image detection
  • Integration with Puppet

If you don’t understand what all of those mean, and you don’t want to go source diving, don’t worry about it! I will explain them in greater detail when it’s important, and hopefully for now everything “just works” and stays out of your way.

Future work:

There’s still a lot more that I have planned, and some parts of the Vagrantfile need clean up, but I figured I’d try and release this early so that you can get hacking right away. If it’s useful to you, please leave a comment and let me know.

Happy hacking,

James

 


Translations Between Domains: David Woods

One of the reasons I’ve continued to be more and more interested in Human Factors and Safety Science is that I found myself without many answers to the questions I have had in my career. Questions surrounding how organizations work, how people think and work with computers, how decisions get made under uncertainty, and how do people cope with increasing amounts of complexity.

As a result, my journey took me deep into a world where I immediately saw connections — between concepts found in other high-tempo, high-consequence domains and my own world of software engineering and operations. One of the first connections was in Richard Cook’s How Complex Systems Fail, and it struck me so deeply I insisted that it get reprinted (with additions by Richard) into O’Reilly’s Web Operations book.

I simply cannot un-see these connections now, and the field of study keeps me going deeper. So deep that I felt I needed to get a degree. My goal with getting a degree in the topic is not just to satisfy my own curiosity, but also to explore these topics in sufficient depth to feel credible in thinking about them critically.

In software, the concept and sometimes inadvertent practice of “cargo cult engineering” is well known. I’m hoping to avoid that in my own translation(s) of what’s been found in human factors, safety science, and cognitive systems engineering, as they looked into domains like aviation, patient safety, or power plant operations. Instead, I’m looking to truly understand that work in order to know what to focus on in my own research as well as to understand how my domain is either similar (and in what ways?) or different (and in what ways?)

For example, just a hint of what sorts of questions I have been mulling over:

  • How does the concept of “normalization of deviance” manifest in web engineering? How does it relate to our concept of ‘technical debt’?
  • What organizational dynamics might be in play when it comes to learning from “successes” and “failures”?
  • What methods of inquiry can we use to better design interfaces that have functionality and safety and diagnosis support as their core? Or, are those goals in conflict? If so, how?
  • How can we design alerts to reduce noise and increase signal in a way that takes into account the context of the intended receiver of the alert? In other words, how can we teach alerts to know about us, instead of the other way around?
  • The Internet (include its technical, political, and cultural structures) has non-zero amounts of diversity, interdependence, connectedness, and adaptation, which by many measures constitutes a complex system.
  • How do successful organizations navigate trade-offs when it comes to decisions that may have unexpected consequences?

I’ve done my best to point my domain at some of these connections as I understand them, and the Velocity Conference has been one of the ways I’ve hoped to bring people “over the bridge” from Safety Science, Human Factors, and Cognitive Systems Engineering into software engineering and operations as it exists as a practice on Internet-connected resources. If you haven’t seen Dr. Richard Cook’s 2012 and 2013 keynotes, or Dr. Johan Bergstrom’s keynote, stop what you’re doing right now and watch them.

I’m willing to bet you’ll see connections immediately…



DavidWoodsDavid Woods is one of the pioneers in these fields, and continues to be a huge influence on the way that I think about our domain and my own research (my thesis project relies heavily on some of his previous work) and I can’t be happier that he’s speaking at Velocity in New York, which is coming up soon. (Pssst: if you register for it here, you can use the code “JOHN20″ for 20% discount)

I have posted before (and likely will again) about a paper Woods contributed to, Common Ground and Coordination in Joint Activity (Klein, Feltovich, Bradshaw, & Woods, 2005) which in my mind might as well be considered the best explanation on what “devops” means to me, and what makes successful teams work. If you haven’t read it, do it now.

 

Dynamic Fault Management and Anomaly Response

I thought about listing all of Woods’ work that I’ve seen connections in thus far, but then I realized that if I wasn’t careful, I’d be writing a literature review and not a blog post. :) Also, I have thesis work to do. So for now, I’d like to point only at two concepts that struck me as absolutely critical to the day-to-day of many readers of this blog, dynamic fault management and anomaly response.

Woods sheds some light on these topics in Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. Pay particular attention to the characteristics of these phenomenons:

“In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity—what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem.

Anomaly response situations frequently involve time pressure, multiple interacting goals, high consequences of failure, and multiple interleaved tasks (Woods, 1988; 1994). Typical examples of fields of practice where dynamic fault management occurs include flight deck operations in commercial aviation (Abbott, 1990), control of space systems (Patterson et al., 1999; Mark, 2002), anesthetic management under surgery (Gaba et al., 1987), terrestrial process control (Roth, Woods & Pople, 1992), and response to natural disasters.” (Woods & Hollnagel, 2006, p.71)

Now look down at the distributed systems you’re designing and operating.

Look at the “runbooks” and postmortem notes that you have written in the hopes that they can help guide teams as they try to untangle the sometimes very confusing scenarios that outages can bring.

Does “safing” ring familiar to you?

Do you recognize managing “multiple interleaved tasks” under “time pressure” and “high consequences of failure”?

I think it’s safe to say that almost every Velocity Conference attendee would see connections here.

In How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands (Woods & Patterson, 1999), he introduces the concept of escalation, in terms of anomaly response:

The concept of escalation captures a dynamic relationship between the cascade of effects that follows from an event and the demands for cognitive and collaborative work that escalate in response (Woods, 1994). An event triggers the evolution of multiple interrelated dynamics.

  • There is a cascade of effects in the monitored process. A fault produces a time series of disturbances along lines of functional and physical coupling in the process (e.g., Abbott, 1990). These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications (Reiersen, Marshall, & Baker, 1988).
  • Demands for cognitive activity increase as the problem cascades. More knowledge potentially needs to be brought to bear. There is more to monitor. There is a changing set of data to integrate into a coherent assessment. Candidate hypotheses need to be generated and evaluated. Assessments may need to be revised as new data come in. Actions to protect the integrity and safety of systems need to be identified, carried out, and monitored for success. Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.
  • Demands for coordination increase as the problem cascades. As the cognitive activities escalate, the demand for coordination across people and across people and machines rises. Knowledge may reside in different people or different parts of the operational system. Specialized knowledge and expertise from other parties may need to be brought into the problem-solving process. Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process. The trouble in the underlying process requires informing and updating others – those whose scope of responsibility may be affected by the anomaly, those who may be able to support recovery, or those who may be affected by the consequences the anomaly could or does produce.
  • The cascade and escalation is a dynamic process. A variety of complicating factors can occur, which move situations beyond canonical, textbook forms. The concept of escalation captures this movement from canonical to nonroutine to exceptional. The tempo of operations increases following the recognition of a triggering event and is synchronized by temporal landmarks that represent irreversible decision points.

When I read…

“These disturbances produce a cascade of multiple changes in the data available about the state of the underlying process, for example, the avalanche of alarms following a fault in process control applications” 

I think of many large-scale outages and multi-day recovery activities, like this one that you all might remember (AWS EBS/RDS outage, 2011).

When I read…

“Existing plans need to be modified or new plans formulated to cope with the consequences of anomalies. Contingencies need to be considered in this process. All these multiple threads challenge control of attention and require practitioners to juggle more tasks.” 

I think of many outage response scenarios I have been in with multiple teams (network, storage, database, security, etc.) gathering data from places they

When I read…

“Multiple parties may have to coordinate to implement activities aimed at gaining information to aid diagnosis or to protect the monitored process.”

I think of these two particular outages, and how in the fog of ambiguous signals coming in during diagnosis of an issue, there is a “divide and conquer” effort distributed throughout differing domain expertise (database, network, various software layers, hardware, etc.) that aims to split the search space of diagnosis, while at the same time keeping each other up-to-date on what pathologies have been eliminated as possibilities, what new data can be used to form hypotheses about what’s going on, etc.

I will post more on the topic of anomaly response in detail (and more of Woods’ work) in another post.

In the meantime, I urge you to take a look at David Woods’ writings, and look for connections in your own work. Below is a talk David gave at IBM’s Almaden Research Center, called “Creating Safety By Engineering Resilience”:

David D. Woods, Creating Safety by Engineering Resilience from jspaw on Vimeo.

References

Hollnagel, E., & Woods, D. D. (1983). Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 139–184.

Woods, D. D. (1995). The alarm problem and directed attention in dynamic fault management. Ergonomics. doi:10.1080/00140139508925274

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems : patterns in cognitive systems engineering. Boca Raton : CRC/Taylor & Francis.

Woods, D. D., & Patterson, E. S. (1999). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands. Stress, 1–13.

Woods, D. D., Patterson, E. S., & Roth, E. M. (2002). Can We Ever Escape from Data Overload? A Cognitive Systems Diagnosis. Cognition, Technology & Work, 4(1), 22–36. doi:10.1007/s101110200002

Teaching Engineering As A Social Science

Below is a piece written by Edward Wenk, Jr., which originally appeared in PRlSM, the magazine for the American Society for Engineering Education (Publication Volume 6. No. 4. December 1996.)

While I think that there’s much more than what Wenk points to as ‘social science’ – I agree wholeheartedly with his ideas. I might even say that he didn’t go far enough in his recommendations.

Enjoy. :)

 

Edward Wenk, Jr.

Teaching Engineering as a Social Science

Today’s public engages in a love affair with technology, yet it consistently ignores the engineering at technology’s core. This paradox is reinforced by the relatively few engineers in leadership positions. Corporations, which used to have many engineers on their boards of directors, today are composed mainly of M.B.A.s and lawyers. Few engineers hold public office or even run for office. Engineers seldom break into headlines except when serious accidents are attributed to faulty design.

While there are many theories on this lack of visibility, from inadequate public relations to inadequate public schools, we may have overlooked the real problem: Perhaps people aren’t looking at engineers because engineers aren’t looking at people.

If engineering is to be practiced as a profession, and not just a technical craft, engineers must learn to harmonize natural sciences with human values and social organization. To do this we must begin to look at engineering as a social science and to teach, practice, and present engineering in this context.

To many in the profession, looking at teaching engineering as a social science is anathema. But consider the multiple and profound connections of engineering to people.

Technology in Everyday Life

The work of engineers touches almost everyone every day through food production, housing, transportation, communications, military security, energy supply, water supply, waste disposal, environmental management, health care, even education and entertainment. Technology is more than hardware and silicon chips.

In propelling change and altering our belief systems and culture, technology has joined religion, tradition, and family in the scope of its influence. Its enhancements of human muscle and human mind are self-evident. But technology is also a social amplifier. It stretches the range, volume, and speed of communications. It inflates appetites for consumer goods and creature comforts. It tends to concentrate wealth and power, and to increase the disparity of rich and poor. In the com- petition for scarce resources, it breeds conflicts.

In social psychological terms, it alters our perceptions of space. Events anywhere on the globe now have immediate repercussions everywhere, with a portfolio of tragedies that ignite feelings of helplessness. Technology has also skewed our perception of time, nourishing a desire for speed and instant gratification and ignoring longer-term impacts.

Engineering and Government

All technologies generate unintended consequences. Many are dangerous enough to life, health, property, and environment that the public has demanded protection by the government.

Although legitimate debates erupt on the size of government, its cardinal role is demonstrated in an election year when every faction seeks control. No wonder vested interests lobby aggressively and make political campaign contributions.

Whatever that struggle, engineers have generally opted out. Engineers tend to believe that the best government is the least government, which is consistent with goals of economy and efficiency that steer many engineering decisions without regard for social issues and consequences.

Problems at the Undergraduate Level

By both inclination and preparation, many engineers approach the real world as though it were uninhabited. Undergraduates who choose an engineering career often see it as escape from blue- collar family legacies by obtaining the social prestige that comes with belonging to a profession. Others love machines. Few, however, are attracted to engineering because of an interest in people or a commitment to public service. On the contrary, most are uncomfortable with the ambiguities human behavior, its absence of predictable cause and effect, its lack of control, and with the demands for direct encounters with the public.

Part of this discomfort originates in engineering departments, which are often isolated from arts, humanities, and social sciences classrooms by campus geography as well as by disparate bodies of scholarly knowledge and cultures. Although most engineering departments require students to take some nontechnical courses, students often select these on the basis of hearsay, academic ease, or course instruction, not in terms of preparation for life or for citizenship.

Faculty attitudes don’t help. Many faculty members enter teaching immediately after obtaining their doctorates, their intellect sharply honed by a research specialty. Then they continue in that groove because of standard academic reward systems for tenure and promotion. Many never enter a professional practice that entails the human equation.

We can’t expect instant changes in engineering education. A start, however, would be to recognize that engineering is more than manipulation of intricate signs and symbols. The social context is not someone else’s business. Adopting this mindset requires a change in attitudes. Consider these axioms:

  • Technology is not just hardware; it is a social process.
  • All technologies generate side effects that engineers should try to anticipate and to protect against.
  • The most strenuous challenge lies in synthesis of technical, social, economic, environmental, political, and legal processes.
  • For engineers to fulfill a noblesse oblige to society, the objectivity must not be defined by conditions of employment, as, for example, in dealing with tradeoffs by an employer of safety for cost.

In a complex, interdependent, and sometimes chaotic world, engineering practice must continue to excel in problem solving and creative synthesis. But today we should also emphasize social responsibility and commitment to social progress. With so many initiatives having potentially unintended consequences, engineers need to examine how to serve as counselors to the public in answering questions of “What if?” They would thus add sensitive, future-oriented guidance to the extraordinary power of technology to serve important social purposes.

In academic preparation, most engineering students miss exposure to the principles of social and economic justice and human rights, and to the importance of biological, emotional, and spiritual needs. They miss Shakespeare’s illumination of human nature – the lust for power and wealth and its corrosive effects on the psyche, and the role of character in shaping ethics that influence professional practice. And they miss models of moral vision to face future temptations.

Engineering’s social detachment is also marked by a lack of teaching about the safety margins that accommodate uncertainties in engineering theories, design assumptions, product use and abuse, and so on. These safety margins shape practice with social responsibility to minimize potential harm to people or property. Our students can learn important lessons from the history of safety margins, especially of failures, yet most use safety protocols without knowledge of that history and without an understanding of risk and its abatement. Can we expect a railroad systems designer obsessed with safety signals to understand that sleep deprivation is even more likely to cause accidents? No, not if the systems designer lacks knowledge of this relatively common problem.

Safety margins are a protection against some unintended consequences. Unless engineers appreciate human participation in technology and the role of human character in performance, they are unable to deal with demons that undermine the intended benefits.

Case Studies in Socio-Technology

Working for the legislative and executive branches of US. government since the 1950s, I have had a ringside seat from which to view many of the events and trends that come from the connections between engineering and people. Following are a few of those cases.

Submarine Design

The first nuclear submarine, USS Nautilus, was taken on its deep submergence trial February 28, I955. The subs’ power plant had been successfully tested in a full-scale mock-up and in a shallow dive, but the hull had not been subject to the intense hydrostatic pressure at operating depth. The hull was unprecedented in diameter, in materials, and in special joints connecting cylinders of different diameter. Although it was designed with complex shell theory and confirmed by laboratory tests of scale models, proof of performance was still necessary at sea.

During the trial, the sub was taken stepwise to its operating depth while evaluating strains. I had been responsible for the design equations, for the model tests, and for supervising the test at sea, so it was gratifying to find the hull performed as predicted.

While the nuclear power plant and novel hull were significant engineering achievements, the most important development occurred much earlier on the floor of the US. Congress. That was where the concept of nuclear propulsion was sold to a Congressional committee by Admiral Hyman Rickover, an electrical engineer. Previously rejected by a conservative Navy, passage of the proposal took an electrical engineer who understood how Constitutional power was shared and how to exercise the right of petition. By this initiative, Rickover opened the door to civilian nuclear power that accounts for 20 percent of our electrical generation, perhaps 50 percent in France. If he had failed, and if the Nautilus pressure hull had failed, nuclear power would have been set back by a decade.

Space Telecommunications

Immediately after the 1957 Soviet surprise of Sputnik, engineers and scientists recognized that global orbits required all nations to reserve special radio channels for telecommunications with spacecraft. Implementation required the sanctity of a treaty, preparation of which demanded more than the talents of radio specialists; it engaged politicians, space lawyers, and foreign policy analysts. As science and technology advisor to Congress, I evaluated the treaty draft for technical validity and for consistency with U.S. foreign policy.

The treaty recognized that the airwaves were a common property resource, and that the virtuosity of communications engineering was limited without an administrative protocol to safeguard integrity of transmissions. This case demonstrated that all technological systems have three major components — hardware or communications equipment; software or operating instructions (in terms of frequency assignments); and peopleware, the organizations that write and implement the instructions.

National Policy for the Oceans

Another case concerned a national priority to explore the oceans and to identify U.S. rights and responsibilities in the exploitation and conservation of ocean resources. This issue, surfacing in 1966, was driven by new technological capabilities for fishing, offshore oil development, mining of mineral nodules on the ocean floor, and maritime shipment of oil in supertankers that if spilled could contaminate valuable inshore waters. Also at issue was the safety of those who sailed and fished.

This issue had a significant history. During the late 1950s, the US. Government was downsizing oceanographic research that initially had been sponsored during World War II. This was done without strong objection, partly because marine issues lacked coherent policy or high-level policy leadership and strong constituent advocacy.

Oceanographers, however, wanting to sustain levels of research funding, prompted a study by the National Academy of Sciences (NAS), Using the reports findings, which documented the importance of oceanographic research, NAS lobbied Congress with great success, triggering a flurry of bills dramatized by such titles as “National Oceanographic Program.”

But what was overlooked was the ultimate purpose of such research to serve human needs and wants, to synchronize independent activities of major agencies, to encourage public/private partnerships, and to provide political leadership. During the 1960s, in the role of Congressional advisor, I proposed a broad “strategy and coordination machinery” centered in the Office of the President, the nation’s systems manager. The result was the Marine Resources and Engineering Development Act, passed by Congress and signed into law by President Johnson in 1966.

The shift in bill title reveals the transformation from ocean sciences to socially relevant technology, with engineering playing a key role. The legislation thus embraced the potential of marine resources and the steps for both development and protection. By emphasizing policy, ocean activities were elevated to a higher national priority.

Exxon Valdez

Just after midnight on March 24, 1989, the tanker Exxon Valdez, loaded with 50 million gallons of Alaska crude oil, fetched up on Bligh Reef in Prince William Sound and spilled its guts. For five hours, oil surged from the torn bottom at an incredible rate of 1,000 gallons per second. Attention quickly focused on the enormity of environmental damage and on blunders of the ship operators. The captain had a history of alcohol abuse, but was in his cabin at impact. There was much finger- pointing as people questioned how the accident could happen during a routine run on a clear night. Answers were sought by the National Transportation Safety Board and by a state of Alaska commission to which I was appointed. That blame game still continues in the courts.

The commission was instructed to clarify what happened, why, and how to keep it from happening again. But even the commission was not immune to the political blame game. While I wanted to look beyond the ship’s bridge and search for other, perhaps more systemic problems, the commission chair blocked me from raising those issues. Despite my repeated requests for time at the regularly scheduled sessions, I was not allowed to speak. The chair, a former official having tanker safety responsibilities in Alaska, had a different agenda and would only let the commission focus largely on cleanup rather than prevention. Fortunately, I did get to have my say by signing up as a witness and using that forum to express my views and concerns.

The Exxon Valdez proved to be an archetype of avoidable risk. Whatever the weakness in the engineered hardware, the accident was largely due to internal cultures of large corporations obsessed with the bottom line and determined to get their way, a U.S. Coast Guard vulnerable to political tampering and unable to realize its own ethic, a shipping system infected with a virus of tradition, and a cast of characters lulled into complacency that defeated efforts at prevention.

Lessons

These examples of technological delivery systems have unexpected commonalities. Space telecommunications and sea preservation and exploitation were well beyond the purview of just those engineers and scientists working on the projects; they involved national policy and required interaction between engineers, scientists, users, and policymakers. The Exxon Valdez disaster showed what happens when these groups do not work together. No matter how conscientious a ship designer is about safety, it is necessary to anticipate the weaknesses of fallibility and
the darker side of self-centered, short-term ambition.

Recommendations

Many will argue that the engineering curriculum is so overloaded that the only source of socio- technical enrichment is a fifth year. Assuming that step is unrealistic, what can we do?

  • The hodge podge of nonengineering courses could be structured to provide an integrated foundation in liberal arts.
  • Teaching at the upper division could be problem- rather than discipline-oriented, with examples from practice that integrate nontechnical parameters.
  • Teaching could employ the case method often used in law, architecture, and business.
  • Students could be encouraged to learn about the world around them by reading good newspapers and nonengineering journals.
  • Engineering students could be encouraged to join such extracurricular activities as debating or political clubs that engage students from across the campus.

As we strengthen engineering’s potential to contribute to society, we can market this attribute to women and minority students who often seek socially minded careers and believe that engineering is exclusively a technical pursuit.

For practitioners of the future, something radically new needs to be offered in schools of engineering. Otherwise, engineers will continue to be left out.

Rough data density calculations

Seagate has just publicly announced 8TB HDD’s in a 3.5″ form factor. I decided to do some rough calculations to understand the density a bit better…

Note: I have decided to ignore the distinction between Terabytes (TB) and Tebibytes (TiB), since I always work in base 2, but I hate the -bi naming conventions. Seagate is most likely announcing an 8TB HDD, which is actually smaller than a true 8TiB drive. If you don’t know the difference it’s worth learning.

Rack Unit Density:

Supermicro sells a high density, double-sided 4U server, which can hold 90 x 3.5″ drives. This means you can easily store:

90 * 8TB = 720TB in 4U,

or:

720TB/4U = 180TB per U.

To store a petabyte of data, since:

1PB = 1024TB,

we need:

1024TB/180TB/U = 5.68 U.

Rounding up we realize that we can easily store one petabyte of raw data in 6U.

Since an average rack is usually 42U (tall racks can be 48U) that means we can store between seven and eight PB per rack:

42U/rack / 6U/PB = 7PB/rack

48U/rack / 6U/PB = 8PB/rack

If you can provide the power and cooling, you can quickly see that small data centers can easily get into exabyte scale if needed. One raw exabyte would only require:

1EB = 1024PB

1024PB/7PB/rack = 146 racks =~ 150 racks.

Raid and Redundancy:

Since you’ll most likely have lots of failures, I would recommend having some number of RAID sets per server, and perhaps a distributed file system like GlusterFS to replicate the data across different servers. Suppose you broke each 90 drive server into five separate RAID 6 bricks for GlusterFS:

90/5 = 18 drives per brick.

In RAID 6, you loose two drives to parity, so that means:

18 drives – 2 drives = 16 drives per brick of usable storage.

16 drives * 5 bricks * 8 TB = 640 TB after RAID 6 in 4U.

640TB/4U = 160TB/U

1024TB/160TB/U = 6.4TB/U =~ 7PB/rack.

Since I rounded a lot, the result is similar. With a replica count of 2 in a standard GlusterFS configuration, you average a total of about 3-4PB of usable storage per rack. Need a petabyte scale filesystem? One rack should do it!

Other considerations:

  • Remember that you need to take into account space for power, cooling and networking.
  • Keep in mind that SMR might be used to increase density even further (unless it’s not already being used on these drives).
  • Remember that these calculations were done to understand the order of magnitude, and not to get a precise measurement on the size of a planned cluster.
  • Petabyte scale is starting to feel small…

Conclusions:

Storage is getting very inexpensive. After the above analysis, I feel safe in concluding that:

  1. Puppet-Gluster could easily automate a petabyte scale filesystem.
  2. I have an embarrassingly small amount of personal storage.

Hope this was fun,

Happy hacking,

James

 

Disclaimer: I have not tried the 8TB Seagate HDD’s, or the Supermicro 90 x 3.5″ servers, but if you are building a petabyte scale cluster with GlusterFS/Puppet-Gluster, I’d like to hear about it!

 


Hybrid management of FreeIPA types with Puppet

(Note: this hybrid management technique is being demonstrated in the puppet-ipa module for FreeIPA, but the idea could be used for other modules and scenarios too. See below for some use cases…)

The error message that puppet hackers are probably most familiar is:

Error: Duplicate declaration: Thing[/foo/bar] is already declared in file /tmp/baz.pp:2; 
cannot redeclare at /tmp/baz.pp:4 on node computer.example.com

Typically this means that there is either a bug in your code, or someone has defined something more than once. As annoying as this might be, a compile error happens for a reason: puppet detected a problem, and it is giving you a chance to fix it, without first running code that could otherwise leave your machine in an undefined state.

The fundamental problem

The fundamental problem is that two or more contradictory declarative definitions might not be able to be properly resolved. For example, assume the following code:

package { 'awesome':
    ensure => present,
}

package { 'awesome':
    ensure => absent,
}

Since the above are contradictory, they can’t be reconciled, and a compiler error occurs. If they were identical, or if they would produce the same effect, then it wouldn’t be an issue, however this is not directly allowed due to a flaw in the design of puppet core. (There is an ensure_resource workaround, to be used very cautiously!)

FreeIPA types

The puppet-ipa module exposes a bunch of different types that map to FreeIPA objects. The most common are users, hosts, and services. If you run a dedicated puppet shop, then puppet can be your interface to manage FreeIPA, and life will go on as usual. The caveat is that FreeIPA provides a stunning web-ui, and a powerful cli, and it would be a shame to ignore both of these.

The FreeIPA webui is gorgeous. It even gets better in the new 4.0 release.

The FreeIPA webui is gorgeous. It even gets better in the new 4.0 release.

Hybrid management

As the title divulges, my puppet-ipa module actually allows hybrid management of the FreeIPA types. This means that puppet can be used in conjunction with the web-ui and the cli to create/modify/delete FreeIPA types. This took a lot of extra thought and engineering to make possible, but I think it was worth the work. This feature is optional, but if you do want to use it, you’ll need to let puppet know of your intentions. Here’s how…

Type excludes

In order to tell puppet to leave certain types alone, the main ipa::server class has type_excludes. Here is an excerpt from that code:

# special
# NOTE: host_excludes is matched with bash regexp matching in: [[ =~ ]]
# if the string regexp passed contains quotes, string matching is done:
# $string='"hostname.example.com"' vs: $regexp='hostname.example.com' !
# obviously, each pattern in the array is tried, and any match will do.
# invalid expressions might cause breakage! use this at your own risk!!
# remember that you are matching against the fqdn's, which have dots...
# a value of true, will automatically add the * character to match all.
$host_excludes = [],       # never purge these host excludes...
$service_excludes = [],    # never purge these service excludes...
$user_excludes = [],       # never purge these user excludes...

Each of these excludes lets you specify a pattern (or an array of patterns) which will be matched against each defined type, and which, if matched, will ensure that your type is not removed if the puppet definition for it is undefined.

Currently these type_excludes support pattern matching in bash regexp syntax. If there is a strong demand for regexp matching in either python or ruby syntax, then I will add it. In addition, other types of exclusions could be added. If you’d like to exclude based on some types value, creation time, or some other property, these could be investigated. The important thing is to understand your use case, so that I know what is both useful and necessary.

Here is an example of some host_excludes:

class { '::ipa::server':
    host_excludes => [
        "'foo-42.example.com'",                  # exact string match
        '"foo-bar.example.com"',                 # exact string match
        "^[a-z0-9-]*\\-foo\\.example\\.com$",    # *-foo.example.com or:
        "^[[:alpha:]]{1}[[:alnum:]-]*\\-foo\\.example\\.com$",
        "^foo\\-[0-9]{1,}\\.example\\.com"       # foo-<\d>.example.com
    ],
}

This example and others are listed in the examples/ folder.

Type modification

Each type in puppet has a $modify parameter. The significance of this is quite simple: if this value is set to false, then puppet will not be able to modify the type. (It will be able to remove the type if it becomes undefined, which is what the type_excludes mentioned above is used for.)

This $modify parameter is particularly useful if you’d like to define your types with puppet, but allow them to be modified afterwards by either the web-ui or the cli. If you change a users phone number, and this parameter is false, then it will not be reverted by puppet. The usefulness of this field is that it allows you to define the type, so that if it is removed manually in the FreeIPA directory, then puppet will notice its absence, and re-create it with the defaults you originally defined.

Here is an example user definition that is using $modify:

ipa::server::user { 'arthur@EXAMPLE.COM':
    first => 'Arthur',
    last => 'Guyton',
    jobtitle => 'Physiologist',
    orgunit => 'Research',
    #modify => true, # optional, since true is the default
}

By default, in true puppet style, the $modify parameter defaults to true. One thing to keep in mind: if you decide to update the puppet definition, then the type will get updated, which could potentially overwrite any manual change you made.

Type watching

Type watching is the strict form of type modification. As with type modification, each type has a $watch parameter. This also defaults to true. When this parameter is true, each puppet run will compare the parameters defined in puppet with what is set on the FreeIPA server. If they are different, then puppet will run a modify command so that harmony is reconciled. This is particularly useful for ensuring that the policy that you’ve defined for certain types in puppet definitions is respected.

Here’s an example:

ipa::server::host { 'nfs':    # NOTE: adding .${domain} is a good idea....
    domain => 'example.com',
    macaddress => "00:11:22:33:44:55",
    random => true,        # set a one time password randomly
    locality => 'Montreal, Canada',
    location => 'Room 641A',
    platform => 'Supermicro',
    osstring => 'RHEL 6.6 x86_64',
    comment => 'Simple NFSv4 Server',
    watch => true,    # read and understand the docs well
}

If someone were to change one of these parameters, puppet would revert it. This detection happens through an elaborate difference engine. This was mentioned briefly in an earlier article, and is probably worth looking at if you’re interested in python and function decorators.

Keep in mind that it logically follows that you must be able to $modify to be able to $watch. If you forget and make this mistake, puppet-ipa will report the error. You can however, have different values of $modify and $watch per individual type.

Use cases

With this hybrid management feature, a bunch of new use cases are now possible! Here are a few ideas:

  • Manage users, hosts, and services that your infrastructure requires, with puppet, but manage non-critical types manually.
  • Manage FreeIPA servers with puppet, but let HR manage user entries with the web-ui.
  • Manage new additions with puppet, but exclude historical entries from management while gradually migrating this data into puppet/hiera as time permits.
  • Use the cli without fear that puppet will revert your work.
  • Use puppet to ensure that certain types are present, but manage their data manually.
  • Exclude your development subdomain or namespace from puppet management.
  • Assert policy over a select set of types, but manage everything else by web-ui and cli.

Testing with Vagrant

You might want to test this all out. It’s all pretty automatic if you’ve followed along with my earlier vagrant work and my puppet-gluster work. You don’t have to use vagrant, but it’s all integrated for you in case that saves you time! The short summary is:

$ git clone --recursive https://github.com/purpleidea/puppet-ipa
$ cd puppet-ipa/vagrant/
$ vs
$ # edit puppet-ipa.yaml (although it's not necessary)
$ # edit puppet/manifests/site.pp (optionally, to add any types)
$ vup ipa1 # it can take a while to download freeipa rpm's
$ vp ipa1 # let the keepalived vip settle
$ vp ipa1 # once settled, ipa-server-install should run
$ vfwd ipa1 80:80 443:443 # if you didn't port forward before...
# echo '127.0.0.1   ipa1.example.com ipa1' >> /etc/hosts
$ firefox https://ipa1.example.com/ # accept self-sign https cert

Conclusion

Sorry that I didn’t write this article sooner. This feature has been baked in for a while now, but I simply forgot to blog about it! Since puppet-ipa is getting quite mature, it might be time for me to create some more formal documentation. Until then,

Happy hacking,

James

 


3 Reasons Why Your Team Needs Rituals

It’s the same every morning: you get up and grab your morning coffee. No matter whether you brew it at home or fetch it on the road, your morning coffee is a ritual you never want to miss.

A ritual is a practice everyone knows how to do. It’s conducted regularly or on well defined occasions. Rituals help to create an identity for a group of people: nations, sports clubs or teams. How can rituals help form a high performing team?

Rituals Act as Social Glue

Rote repetition of team tasks creates a feeling of togetherness. Often you see teams invent their own, sometimes secret, rituals to more closely bind the group. Executing well known procedures together synchronizes people and strengthens the common ground on which to build trust. If you’re building a new team, or have issues with mistrust, introduce a few rituals like daily stand-ups or weekly planning meetings. Social rituals like team celebrations help as well.

Rituals Pace Your Work

They create a rhythm which structures your work week. You get used to having a daily stand-up every morning at 10 a.m. You know exactly that the next demo will be on Thursday afternoon – no surprises here. Adapting to changing schedules depletes energy, but rituals become second nature to everyone and create an environment where work can flow without friction. There’s no need to constantly watch out for last minute changes in schedules.

Rituals Help Keep Your Discipline

If you haven’t built rituals out of important meetings like retrospectives, it’s very likely you start to skip them. There are always reasons not to do what you should, but, if you make a ritual out of the important things, chances are better that you’ll do what’s needed. Following rituals is deeply embedded within humans. By using rituals, you tap into that deep, human power – the basis of which forms our communities.

But Our Work Environment Changes Too Frequently to Build on Rituals

I hear that a lot. In some work environments you hardly know what you’ll be doing in a few hours. How can you plan for a regular “ritual”? In such cases, it helps to introduce them based on events instead of time. A simple example is celebrating the birthday of a colleague. Make sure every birthday follows a certain ritual: bring everyone together for five minutes, blow out a candle and give a gift.

Another ritual which helps structure your day are daily stand-ups. Come together every day at a fixed time. For a few minutes, talk about what you did yesterday, what you’re doing today and whether you’re having problems. This ritual helps synchronize the team and creates a shared understanding of what’s happening.

One of the Biggest Mistakes is Skipping a Ritual

If you’ve agreed on a certain ritual, stick to it no matter what. People come to expect that the ritual will happen, and they’ll be disappointed if their expectations aren’t met. That’s the power of rituals: people rely on them.

Rituals act as social glue, help pace your work, and foster discipline. Like your regular morning coffee, they give your team structure and stability in an ever changing environment.

Rundeck and Automating Operations at Salesforce (Videos)

A few interesting videos have been posted over on Rundeck.org talking about Salesforce’s internal automation project, codename Gigantor. I’ve embedded the videos below.

It’s a great example of using a toolchain philosophy  to quickly build effective solutions at scale:

Rundeck is the workflow engine and system of record
SaltStack is the distributed execution engine (Salesforce’s Kim Ho wrote the SaltStack plugin for Rundeck)
Kingpin is a custom front-end that builds Salesforce-specific concepts and additional security constraints into the user experience

Gigantor_RundeckSalt

 

Kim Ho explains Gigantor’s architecutre and gives a demo of the SaltStack Plugin for Rundeck:

 

Alan Caudill presents an overview of how Gigantor works and some of the design choices:

 

 

The post Rundeck and Automating Operations at Salesforce (Videos) appeared first on dev2ops.

One minute hacks: the nautilus scripts folder

Master SDN hacker Flavio sent me some tunes. They were sitting on my desktop in a folder:

$ ls ~/Desktop/
uncopyrighted_tunes_from_flavio/

I wanted to listen them while hacking, but what was the easiest way…? I wanted to use the nautilus file browser to select which folder to play, and the totem music/video player to do the playing.

Drop a file named totem into:

~/.local/share/nautilus/scripts/

with the contents:

#!/bin/bash
# o hai from purpleidea
exec totem -- "$@"

and make it executable with:

$ chmod u+x ~/.local/share/nautilus/scripts/totem

Now right-click on that music folder in nautilus, and you should see a Scripts menu. In it there will be a totem menu item. Clicking on it should load up all the contents in totem and you’ll be rocking out in no time. You can also run scripts with a selection of various files.

Here’s a screenshot:

nautilus is pretty smart and lets you know that this folder is special

nautilus is pretty smart and even lets you know that this folder is special

I wrote this to demonstrate a cute nautilus hack. Hopefully you’ll use this idea to extend this feature for something even more useful.

Happy hacking,

James

 


One minute hacks: the nautilus scripts folder

Master SDN hacker Flavio sent me some tunes. They were sitting on my desktop in a folder:

$ ls ~/Desktop/
uncopyrighted_tunes_from_flavio/

I wanted to listen them while hacking, but what was the easiest way…? I wanted to use the nautilus file browser to select which folder to play, and the totem music/video player to do the playing.

Drop a file named totem into:

~/.local/share/nautilus/scripts/

with the contents:

#!/bin/bash
# o hai from purpleidea
exec totem -- "$@"

and make it executable with:

$ chmod u+x ~/.local/share/nautilus/scripts/totem

Now right-click on that music folder in nautilus, and you should see a Scripts menu. In it there will be a totem menu item. Clicking on it should load up all the contents in totem and you’ll be rocking out in no time. You can also run scripts with a selection of various files.

Here’s a screenshot:

nautilus is pretty smart and lets you know that this folder is special

nautilus is pretty smart and even lets you know that this folder is special

I wrote this to demonstrate a cute nautilus hack. Hopefully you’ll use this idea to extend this feature for something even more useful.

Happy hacking,

James

 


Common Objections to DevOps from Enterprise Operations

I’ve been in many large enterprise companies helping them learn about devops, helping them understand how to improve their service delivery capability. These companies have heard about devops and are looking for help creating a strategy to adopt devops principles because they need better time to market and higher quality. Not everyone in the company believes in devops for different reasons. To some, devops sounds like a free for all where devs make production changes. To others devops sounds like a bunch of nice sounding high ideals or that devops can’t be adopted because the necessary automation tooling does not exist for their domain.

DevOpsEntOpsObjects

In the enterprise, the operations group is often centralized and supports many different application groups. When it comes to site availability, the buck stops with ops. If there is a performance problem, outage or issue, the ops team is the first line of defense, sometimes escalating issues back to the application team for bug fixes or for help diagnosing a problem.

Enterprises interested in devops are also usually practicing or adopting agile methodology in which case demands on ops happen more often, during sprints (e.g., to set up a test environment) or after a sprint when ops needs to release software to the production site. The quickened pace puts a lot more pressure on the centralized ops team because they often get the work late in the project cycle (i.e., when it’s time to release to production). Because of time pressure or because they are over worked, operations teams have difficulty turning requested work around and begin to hear developers want to do things for themselves. Those users might want to rebuild servers, get shell access, install software, run commands and scripts, provision VMs, modify network ACLs, update load balancers, etc. These users essentially want to do things for themselves and might feel like the centralized ops team needs to get out of their way.

How does the ops team, historically the one responsible for uptime in the production environment, permit or expand access to environments they support? How can they avoid being the bottleneck at the tail end of every application team’s project cycle? How does the business remove the friction but not invite chaos, outages and lack of compliance?

If you’re in this kind of enterprise environment, how do you start approaching devops? If you are a centralized operations team facing the pressure to adopt devops, here are some questions and concerns for the organization to ask or think about. The answer to these questions are important steps to forming your devops strategy.

How does a centralized group handle the work that needs to be done to make applications run in production or across other environments?

For some enterprises, they begin by creating a specialized team called “devops” whose purpose is to solve “devops problems”. Generally, this means making things more operations friendly. This kind of team might also be the group that takes the hand off from application development teams and wrap their software in automation tooling, deploy it, and hand it off to the Site Reliability team. Unfortunately, a centralized devops team can become a silo and suffer from the same “late in the cycle” handoff challenges the traditional ops group sees. Also, there is always more developers and development projects than there can be devops engineers and devops team bandwidth. A centralized devops team can end up facing the same pressures as a traditional QA department does when they try “adding quality testing” as a separate process stage.

To make sure an application operates well in production and across other environments the devops concerns must be baked into the application architecture. This means the work to make applications easy to configure, deploy and monitor is done inside the development stage. The centralized operations group must then learn to develop a shared software delivery process and tool chain. It’s inside the delivery tool chain where the work gets distributed across teams. The centralized ops group can support the tool chain like architects and service providers providing the application development teams a framework and scaffolding to populate the needed artifacts to drive their pipeline.

What about our compliance policies?

Most enterprises abide by a change policy that dictates who can make production changes. Many times this policy is interpreted to mean anybody outside of ops is not allowed to push changes. Software must be handed off to an ops person to push the change. This handoff can introduce extra lead time and possibly errors due to lack of information.

These compliance rules are defined by the business and many times people on the delivery end have never actually read the language of these policies and base process on assumptions or their beliefs formed by tribal knowledge. Over time, tools and processes can morph in arcane ways, twisting into inefficent bureaucracy.

It’s common to find different compliance rules apply depending on the application or customer type. When thinking about how to reduce delivery cycle time, these differences should be taken into account because there might be alternative ways for seeing who and how change can be made.

Besides understanding the compliance rules, it should also be simple and fast to audit your compliance.

This means make it easy to find out:

  • who made the change and were they authorized
  • where the change was applied
  • what change was made and is it acceptable

This kind of query should be instantly accessible and not something done through manual evidence gathering long after the fact (e.g., when something went wrong). Knowing how change was made to an environment should be as visible as seeing a report that shows how busy your servers were in the last 24 hours.
These audit views should contain infrastructure and artifact information because both development and operations people want to know about their environments in software and server terms. A change ticket with a bunch of verbiage and bug links does not paint a complete enough picture.

How do you open access but not lose controls?

After walking through a software delivery process it’s easy to see the flow of work slows anytime the work must be done by a single team that is already past their capacity and is losing effectiveness due to context switching between competing priorities. This is the situation an ops team often finds itself. Ops teams balance work that comes from application development teams (e.g., participate in agile dev sprints), network operations (e.g., handling outages and production issues), business users (e.g., gathering info for compliance, asset info for finance) and finally, their own project work to maintain or improve infrastructure.

To free this process bottleneck the organization must figure out how the work can be redistributed or can be satisified by some self service function. Since deployment, configuration and monitoring are ops concerns that should be designed into the application, distribute this development to the developers. This can really be a collaboration where ops maintains a base set of automation modules and give developers ways to extend it. Create a development environment and tooling that lets developers integrate their changes into this ops framework in their own project sandboxes.
Provide developer access to create hosted environments easily through a self service interface that spins up the VMs or containers and lets them test the ops management code.

Build the compliance auditing logs into the ops management framework so you can track what resources are being created and used. This is important if resource conflicts occur and let you learn where more sandboxing is needed or where more fine grained configuration should be defined.

Moving faster leads to less quality, right?

To the business, moving fast is critical to staying competitive by increasing their velocity of innovation. This need to quicken the software delivery pace is almost always the chief motivation to adopt devops practices.

Devops success stories often begin with how many times deployments are done a day. Ten deploys a day, 1000 deploys a day. To an enterprise these metrics can sound mythical. Some enterprises struggle to make one deploy a month and I have seen some enterprises making major releases on an annual basis and the rollout of this release to their customers taking over 30 days. That’s thirty days of lag time and puts the production environment in an inconsistent state making it hard for everyone to cope with production issues. “Is it the new version or the old version causing this yet unidentified issue?” A primary reason operations is reluctant to move faster is due to the problems that occur during or after a change had been made.

When change leads to problems these are typical outcomes:

  • More control process is added (more approval gates, shorter change windows)
  • Change batches get bigger (cram more work into the given change window)
  • Increase in “emergency fixes” (high priority features get fast tracked to avoid the normal change process)
  • High pressure to make application changes quickly results in patching systems and not through the normal software release cycle.

Given these outcomes the idea of moving faster is crazy because obviously it will lead to breaking more stuff more often.

The question is how do organizations learn to be good at making change to their systems? Firstly, it is helpful to think about what kind of safety practices are important to move change. Moving fast means being able to safely change things fast. Here are some general strategies to consider:

Small batches

Large batches of change require more people on hand due to the volume of work and the work can take longer to get done.
The solution is to push less change through so it’s easier to get it done and have less to check and verify when the change is completed.

Rehearsal

Here’s a good mantra, “Don’t practice until you get it right. Practice until you can’t get it wrong.” Don’t make the production change be the first time you have tried it this way. Your change should have been verified multiple times in non production environments before you tried it in production. Don’t rely on luck. Expect failure.

Verifiable process stages

Whether it is a site build out or an update to an existing application, be sure you have well defined checks for your preconditions. This means if you are deploying an application you have a scripted test that confirms your external or environment dependencies before you do the deployment. If you are building a site, be sure you have confirmed the hardware and network environment before you install the operating platform. Building this kind of automated testing at process stage boundaries adds a huge deal of safety by not letting problems slip down stream. You can use these verification checks to decide to “stop the line”.

Process discipline

What leads to places full of snow flake environments, each full of idiosyncratic, specially customized servers and networks? Lack of discipline. If the organization does not manage change consistently together, everyone ends up doing things their own way. How do you know you have process discipline? Look for how much variation you see. If process differs between environments, that is a variation. Snow flake servers are the symptoms of process variation. Process variation means you don’t have process under control. There are two simple metrics to understand how much control you have over your process: lead time and scrap rate. Lead time is how long it takes you to make the change. Scrap rate is how often the change must be reworked to make it right. Rehersal and verifiable process stages will help you bring process under control by reducing scrap rate and stabilizing lead time. The biggest benefit to process discipline is improving your ability to deliver change predictably. The business depends on predictability. With predictability the business can guage how fast or slow it can move.

More access into ops managed environments?

The better everyone understands how things perform in production the better the organization can design their systems to support operations. Making it hard for developers or testers to see how the service is running only delays improvements that benefit the customer and reduces pressure on operations. It should be easy for anyone to know what version of applications are deployed on what hosts, the host configuration and the performance of the application.

Sometimes data privacy rules make accessing data less straightforward. Some logs contain customer data and regulations might restrict access to only limited users. Instead of saying no or making the data collection and scrubbing process manual, make this data available as an automated self service so developers or auditors can get it for themselves.

Visibility into the production environment is crucial for developers to make their environments production-like. Modeling the development and test envrionment so that it resembles production is another example of reducing variabilty and bringing process under control.

Does this mean shell access for devs?

This question is sometimes the worst one for a traditional enterprise ops team. Often times the question is a symptom of another problem. Why does a dev want shell access to an environment operations is supporting? In a development or early test envrionment shell access might be needed to experiment with developing deployment and configuration code. This is a valid reason for shell access.

Is this request for shell access in a staging or production environment? Requests for shell access could be a sign of ad hoc change methods and undermine the stability of an environment. It’s important that change methods are encapsulated in the automation.

Fundamentally, shell access to live operational environments is a question about risk and trust.


The list doesn’t stop here, but these are the most common questions and concerns  I hear. Feel free to share your experiences in the comments below.

The post Common Objections to DevOps from Enterprise Operations appeared first on dev2ops.