The Relativity of Wrong

June 29, 2023

Full Stack Observability use cases

Business Use Cases

Full Stack Observability is all about collecting any possible data from the applications running your digital services (i.e. business KPI) and from the infrastructure and cloud resources supporting them (i.e. the telemetry), including potentially also IoT, robots or whatever device involved in the process.

And then correlating those data to create an actionable insight, so that you have full control of your business processes end-to-end and you do better than your competitors (faster, more reliable, more appealing processes and services).

The FSO value proposition is not only related to technology (the infrastructure that you can monitor and the metrics you can read). It is a business value proposition, because observability has an immediate impact on the business outcomes.

Associating business processes, and digital services supporting those, with the health state of the infrastructure gives the Operations teams an immediate and objective measure of the value - or the troubles - that IT provides to their internal clients, that are the lines of business (LOB). And LOB managers can enjoy dedicated dashboards that show how the business is doing, highlighting all the key performance indicators (KPI) that are relevant for each persona in the organization.

If there is any slowdown in the business, they see it instantly and can eventually relate it to a technical problem, or maybe to the release of a new version of a software application, or to the launch of a new marketing campaign. The outcome of any action and of any incident is connected to the business with... no latency. The same visibility is also useful when the business shows a better performance than the day before. You can relate outcomes to actions and events.

So, before speaking about the technology that supports the Full Stack Observability, let's discuss about the use cases and their impact.

We can group the use cases in three categories: Observe, Secure and Optimize (referred to your end-to-end business architecture).

In the Observe category, we have 4 fundamental use cases:

- Hybrid application monitoring

This refers to every application running on Virtual Machines, in any combination of your Data Center and Public Clouds, or on bare metal servers.

You can relate the business KPI (users served, processes completed, amount of money, etc.) to the health state of the software applications and the infrastructure. You can identify the root cause of any problem and relate it to the business transactions (= user navigation for a specific process) that are affected.

- Cloud native application monitoring

Same as the previous use case, but referred to applications designed based on cloud native patterns (e.g. microservices architecture) that run on Kubernetes or Openshift. Regardless it's on premises, in cloud, or in a hybrid scenario. Traditional APM solutions were not so strong on this use case, because they were designed for older architectures.

- Customer digital experience monitoring

Here the focus is on the experience from the end user perspective, that is affected by the performance of both the applications and the infrastructure, but also - and mostly - by the network. Network problems can eventually affect the response time and the reliability of the service because the end user needs to reach the end point where the application is run (generally a web server), the front end needs to communicate with the application components distributed everywhere, and these may be invoking remote API exposed by a business partner (e.g. a payment gateway or any B2B service).

- Application dependency monitoring

In this use case you want to assure the performance of managed and unmanaged (third-party) application services and APIs, including performance over Internet and cloud networks to reach those services. Visibility of network performance and availability, including both public networks and yours, is critical to resolve issues and to push service providers to respect the SLA of the contract.

In the Secure category, we can discuss the Business Risk Observability use case:

- Application security

Reduce business risk by actively identifying and blocking against vulnerabilities found in application runtimes in production. Associate vulnerabilities with the likelihood that they are exploited in your specific context, so that you can prioritize the suggested remediation actions based on the business impact (shown by the association of vulnerabilities with Business Transactions).

In the Optimize category, we have the following use cases:

- Hybrid cost optimization

Lower costs by only paying for what you need in public cloud and by safely increasing utilization of on—premises assets.

- Application resource optimization

Improve and assure application performance by taking the guesswork out of resource allocation for workloads on—premises and in the public cloud.

Observability and network intelligence coming together

The use cases listed above goes beyond the scope of traditional APM solutions (Application Performance Monitoring) because they require to extend the visibility to every segment of the network. This picture shows an example of possible issues that can affect the end user experience, and need to be isolated and remediated to make sure the user is happy.

That is generally difficult, and requires a number of subject matter experts in different domains, and a number of tools. Very few vendors can offer all the complementary solutions that give you visibility on all aspects of the problem. And, of course, they are not integrated (vertical, siloed monitoring).

Data-driven bi-directional integration

The Full Stack Observability solution from Cisco, instead, covers all the angles and - in addition - it does so in a integrated fashion. The APM tool (AppDynamics) and the Network Monitoring tool (ThousandEyes) are integrated bidirectionally through their API (out of the box, no custom integration is required).

The visibility provided by one tool is greatly enhanced by data coming from the other tool, that are correlated automatically and shown in the same console.

So, if you're investigating about a business transaction, you don't see just the performance of the software stack and its distributed topology, but also the latency, packet loss, jitter and more network metrics in the same context (exactly in the network segments that impact the traffic for that single business transaction, at that instant in time).

Similarly, if you're looking at a network, you immediately know what applications and business transaction would be affected if it fails or slows down. And automated tests can be generated to monitor the networks and the end points, that are created automatically from the topology of the application that the APM tool has discovered.

Exciting times are coming, the Operations teams can expect their life to be much easier when they start adopting a Full stack Observability approach. More detail in next posts...

June 14, 2023

Changing the focus of this blog: now... Observability

My previous post about Infrastructure as Code concludes the exploration of Data Center and Cloud solutions and the methodologies that are related to automation and optimization of IT processes.

I've been working in this area for 15 years, after spending the previous 15 in software development.
It's been an amazing adventure and I really enjoied learning new things, exploring and challenging some limits - and sharing the experience with you.

Now I start focusing on a new adventure... possibly for the next 15 years 😜

I assumed a new professional role, that is the technical lead for Full Stack Observability, EMEA sales, at Cisco Appdynamics. From now on, I will tell you stories about my experience with the ultimate evolution of monitoring: it's all about collecting telemetry from every single component of your business architecture, including digital services (= distributed software applications), computing infrastructure, network, cloud resources, IoT, etc.

It's not just putting all those data together, but correlating them to create an insight. Transforming raw data into information about the health state of your processes, matching business KPI with the state of the infrastructure that supports the services.

To visualize that information and to navigate it, you can subscribe to (or create your own) different domain models, that are views of the world built specifically for each stakeholder: from lines of business to applications managers, from SRE to network operations and security teams...

A domain model is made of entities and their relationships, where entities represent what is relevant in your (business or technical) domain. They might be the usual entities in a APM domain (applications, services, business transactions...) or infrastructure entities (servers, VM, clusters, K8s nodes, etc.).
You can also model entities in a business domain (e.g. trains, stations, passengers, tickets, etc.).

Unlike Application Performance Monitoring (APM), where solutions like Appdynamics and its competitors excel in drilling down in the application architecture and its topology, with Full Stack Observability you really have full control end-to-end and have a context shared among all the teams that collaborate at building, running and operating the business ecosystem.

New standards like OpenTelemetry make it easy to convey Metrics, Events, Logs and Traces (MELT) to a unique platform from every single part of the system, including eventually robots in manufacturing, GPS tracing from your supply chain, etc.

All these data will be filtered according to the domain model and those that are relevant will feed the databases containing information about the domain entities and their relationships, that are used to populate the dashboard.

Those data will be matched with any other source of information that is relevant in your business domain (CRM, sales, weather forecast, logistics...) so that you can analyse and forecast the health of the business and relate it to the technologies and the processes behind. You can immediately remediate any problem because you detect the root cause easily, and even be proactive in preventing problems before they occur (or before they are perceived by end users). At the same time, you are able to spot opportunities for optimising the preformances and the cost efficiency of the system.

To see what is the official messaging from Cisco about the Full Stack Observability, check this page describing the FSO Platform.

Stay tuned, interesting news are coming...

June 9, 2023

Infrastructure as Code: making it easy with Nexus as Code

In previous posts I've described the advantage provided by managing the infrastructure the same way developers manage the application code.

Infrastructure as Code means using the same toolset (version control systems, pipeline orchestrators, automated provisioning) and same processes for building, integrating, testing and releasing the system that are used in the release cycle of a software application. This approach has a positive impact on speed, reliability and security end to end.

Together with Ansible, Terraform is one of the most used tools in the automated provisioning space, and many organizations use it when they adopt Infrastructure as Code. The availability of plugins (Terraform Providers) for almost every possible target (physical and virtual servers, network and storage, cloud services, etc.) makes it a common platform for automation: a "de facto" standard.

As many other technology vendors, Cisco offers Terraform Providers wrapping the API of their products, especially for Data Center and Cloud technologies. The Nexus family of switches, that includes the ACI fabric architecture, makes no exception. You can provision and manage the ACI fabric easily with Terraform (as well as with Ansible), and many examples and reusable assets are available at DevNet.

Generally, Terraform Providers surface the object model of the target system so that resources and the their relationships can be managed easily in a configuration plan, representing the desired state of the system. You need to understand how that particular system works and, in some cases, to manage the relationships among managed objects identifiers explicitly.

This is an example of creating a tenant in ACI, and a VRF contained in it:

Some engineers find this object model, and the use of the HCL (Hashicorp Configuration Language), easy and comfortable. Others, maybe due to a limited experience, would prefer an easier syntax and simpler object model.

For this reason Cisco has created a module called Nexus as Code, that sits on top of the standard ACI provider for Terraform, hiding the perceived complexity and offering a simplified object model. The objects that are contained in each other are simply nested and represented in a way that's very close to the conceptual representation of the logical architecture (represented by the following picture)

Nexus as Code can be seen as a (optional) component in the Terraform solution to automate ACI and other network controllers from Cisco.

Using a configuration language as simple as YAML, nesting is represented with indentation in Nexus as Code. This example corresponds to the HCL snippet above:

This format is particularly suitable for copy/paste operations, that make it easy to clone and modify a template so it is ready for a new project.

If you start from the example above, simply copying one line you can have one more VRF created and contained in the same tenant. Definitely simpler that doing the same in a HCL file, and encouraging for a network engineer the first time he/she uses Terraform.

Everything you need is a folder to store one or more YAML file defining the desired state of the ACI fabric, and the installation of the Terraform binary file (free download from here). After that, you will just use the following two commands:

terraform init (that makes sure that the needed providers are installed, and eventually downloads them automatically)

terraform apply (that reads the input, evaluates changes required to align the state of the target fabric to the desired state, then call the API of the ACI controller)

when you confirm the apply, you will see the log of the execution and finally the message will tell that the job is done.

I believe that Nexus as Code is a powerful tool that may help engineers to approach the IaC (Infrastructure as Code) methodology easier, with no stress due to learning new complex technologies and tools.

Being based on standard, open-source tools, it does not introduce any lock in with Cisco technologies.
It simply translates easy-to-manipulate YAML files, that describe your desired state, into plain Terraform plans that are executed automatically.

So you can start adopting the same tools and same processes that developers use in building, integrating, testing and releasing the system, obtaining the same benefit in terms of speed, consistency, security and self-documentation.

Don't be shy, start today to experiment and see how easy it is 😜

July 4, 2022

Infrastructure as Code: tools and processes

In a previous post we have seen that Infrastructure as Code is a way of managing the infrastructure and the cloud resources, consisting in a set of processes and best practices.

But there is also a need for a set of tools, and this post will offer an overview of the most used tools in the industry. Most of them are free, open source tools, matched by a vendor-supported version that requires a license or a subscription. There are also SaaS versions that offer a free tier.

I'm describing the following tools in this post:

Version Control Systems
Automation tools (Ansible, Terraform)
Accessory tools for "scaffolding" (Vault, etc).

But, before we explore the tools, just a few more words about the process.

Programmable infrastructure

The simple fact that the infrastructure is programmable via the API it exposes, does not mean that anyone can come and change its configuration and/or state.

We don't want anarchy, and even less we want that programmers do whatever they like bypassing the owners of the target technology domain. The administrators, that are also the SME (subject matter experts) have the responsibility to ensure the reliability, performance and security of the system and cannot afford that a naive developer compromises it.

So what we mean with treating the infrastructure as code is applying the same processes, and same tools, as we do with the source code of the applications. The infrastructure provisioning and configuration should follow the same process that we implement for the applications: write the code, version the code, test it statically for quality and security, deploy it automatically, test it dynamically (functional, performance, reliability and security tests), then deploy it in the production environment. Generally, it happens within a CI/CD pipeline with a good level of automation (but the same sequence could also be executed manually).

Now that we have agreed on the basic principles, let's have a look at the tools.

Version Control Systems

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

The purpose of version control is to allow software teams track changes to the code, while enhancing communication and collaboration between team members. Version control facilitates a continuous, simple way to develop software.

Since we want to manage the infrastructure as developers do with the software applications, we use the same organization for the files describing the desired state of the infrastructure (remember, infrastructure includes physical, virtual and cloud resources).

A central repository is the single source of truth. Local working copies can be used to evolve the code, to create new versions and test them. After validation, a new version is committed in the VCS (version control system). The most used tools are github and gitlab, but there is a large choice available.

Pipeline orchestrators

When a new version is created, a number of activities must take place: they can be executed manually or, better, automated to increase speed and reduce vulnerability to human errors.

If any task or test fail, a notification is sent to the right stakeholder to solve the problem and the pipeline aborts. A new pipeline cycle will restart after the issue has been fixed. You might build a single pipeline for the infrastructure and for the application deployment, or more often separate them in distinct processes: depending on the work organization and the availability of resources, there is no strict need to rebuild the target environment every time a new application version is released.

Orchestrators for CI/CD pipelines can be open-source or commercial products or can be engines incorporated in the version control system.

The following picture shows an example of pipeline:

Automation tools (Ansible, Terraform)

Those are not the only tools available for automation, but they are by large the most used.

They access the target system (infrastructure, cloud and components in the software stack) remotely, with no need for a local agent.

Generally the target API are wrapped in plugins for the automation engine (called Ansible modules and Terraform providers) that are either built and supported by the vendor of the target technology, or by the open source community.

Ansible was born for managing servers, so its approach is more orientated towards configuration management. Terraform excels at provisioning resources, and brings you to concepts like immutable infrastructure (see below).

Both the tools are great and let you define the desired state of the system, making sure that the current state matches the desired state. If it does not, changes are executed automatically by destroying configuration items and recreating them as they need to be. Indeed, Ansible tolerates changing the configuration of existing resources, in that it is more procedural than declarative.

Immutable Infrastructure

An approach to managing services and software deployments wherein components are replaced rather than changed. They are effectively redeployed each time any change occurs.

Traditionally, an application or service update requires that a component is changed in production, while the complete service or application remains operational. Immutable infrastructure instead relies on instancing "golden images", where components are assembled on computing resources to form the service or application.

Once the service or application is instantiated, its components are set - thus, the service or application is immutable, unable to change. When a change is made to one or more components of a service or application, a new golden image is assembled, tested, validated and made available for use. Then the old instance is discontinued, to free the computing resources within the environment for other tasks.

You can find a very good description and reusable examples at this two websites:

Immutable Infrastructure with Ansible, Packer and Terraform on Azure - https://devopsand.beer/2022/03/26/immutable-Infrastructure-with-ansible-packer-and-terraform-on-azure/

Immutable Infrastructure Using Packer, Ansible, and Terraform - https://medium.com/paul-zhao-projects/immutable-infrastructure-using-packer-ansible-and-terraform-a275aa6e9ff7

Accessory tools for "scaffolding"

The automation you can build with these tools is amazing, and it saves you time and troubles (sometimes also money as a consequence). As a single individual, or part of a small team, you are much more productive thanks to the reuse of scripts and blueprints, to less troubleshooting required, to higher speed in provisioning.

When the size of the operations team, or of the organization made of different teams, grows beyond a handful of people, some coordination issues start being visible.

If many people use the same scripts (playbooks, configuration plans, etc.) those resources need to be accessible in a central repository (generally a VCS) and you need to enforce RBAC (role based access control) to protect them.
Credentials to access the target systems cannot be stored within the code in the VCS, so you need to store them separately and pass them in as variables.
If a change is pushed to the environment, people need to be notified (even more if a pipeline fails and someone has to fix it).
Bespoke IaC pipelines can stretch across personal machines or shared VMs creating a management nightmare
Terraform state files contains sensitive information which requires special handling and access control

So you start defining processes to work in a ordered manner, and adopting accessory tools to store the secrets (one example is Vault, to store credentials in a safe, centralized place). The governance work and the tools you start accumulating are defined scaffolding, and rapidly become such a burden that they exceed the advantages you've got from adopting the Infrastructure as Code approach (this happens only at a large scale and if you don't have experienced staff).

A solution for this problem is offered by the enterprise version of the tools (both Ansible and Terraform), that is also offered as a SaaS option. The paid versions - that are also supported by the vendors - offer everything you need for large teams' collaboration and avoid that you need to invest in creating all the operational framework.

I'm not saying that you absolutely need those versions, but consider that the miracles an engineer can do with the free, single binary file, local setup of the automation tools are less likely to be seen on a larger team scale when the IaC best practices are broadly adopted. There will be an inflection point where the benefits provided by the enterprise edition justify the cost of the solution.

April 28, 2022

Infrastructure as Code: what's the advantage

This post describes the value provided by managing the infrastructure the same way you manage the source code of software applications, applying standard tools and best practices to the automation. The reference to infrastructure, of course, includes all cloud services incorporated in your architecture.

The following topics areI explore in this post. More posts will follow with a deeper investigation, and to show what is the link between Infrastructure as Code (IaC) and DevOps.

What does Infrastructure as Code mean?
Is IaC a product I can buy?
Most common use cases.
From where do I start?
Resources to practice with Infrastructure as Code.

What does Infrastructure as Code mean

Infrastructure as code (IaC) is the process of managing and provisioning data center environments through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

The IT infrastructure managed by this process includes both physical equipment, such as bare-metal servers, storage and network, as well as virtual machines, and associated configuration resources. The same concept applies to public cloud resources, i.e. IaaS and PaaS services.

The definition files for Infrastructure as Code are maintained in a version control system, similarly to what we do with source code of software applications. Generally, in these files you describe the desired state of the system, rather than a sequence of commands that must be executed. This implies that you trust a component in the infrastructure, called a controller, delegating all the logic and the exception handling to it (or to more than one).

Descriptive model, not commands.

You don't configure the individual components of the system (e.g. 20 switches, or 5 servers or 15 virtual machines and their virtual network) one by one, in the right order, managing eventual error conditions and verifying manually that everything work as expected.

You simply describe what you expect the system to look like to a software controller, that owns the configuration of all the individual components. The controller knows how to contact, provision and configure the elements and to make sure your intent is realised. If any command fails, everything is rolled back to ensure a clean state. The APIC controller in the Cisco ACI architecture has this role, but many examples can be found among Cisco products and other vendors', and open-source solutions.

It is like ordering a slice of cake

versus preparing the cake yourself following the recipe from your grandma:

In other architectures you don't have a centralised controller, but the programmability of the individual targets and the API that they expose allow for a remote, automated management that is still much better than using the command line interface or any GUI offered by the device. One script could update the configuration of dozens of devices at the same time, e.g. adding a VLAN to all the switches in the network.

Treat infrastructure like software (source control, single source of truth).

The input files for this process are text files, using different formats based on the tool you use. They might contain variables, whose values are defined externally to make the template reusable (e.g. via environment variables, databases, or systems designed to keep secrets like Vault). I use the word template for Ansible playbooks, Terraform plans, etc.

In any case these are text files, like the files that contain the source code of a software application. And they can be treated the same way: stored in a versioning system, edited collaboratively, subject to role-based access control, retrieved and built automatically by a pipeline orchestrator.

When you adopt this approach, the latest validated version of the system configuration is stored in the versioning system. You can consider that one the single source of truth, rather than the current configuration of the system (that might be corrupted by uncontrolled manual changes, either made intentionally or by mistake, or consciously applied long time ago for a reason that nobody remembers today). Instead, the last committed version in the repository is documented (including the tests that it passed) and ready to be applied again to reset the system, in case you need to solve a configuration drift, or to clone the environment, or for other use cases that require consistency.

Provision and configure entire environments

One example is creating clones of a complete environment, including computing, network and storage resources, to deploy an application in the different phases of the release process. Even with different sizing, being generated by a single template (or blueprint) makes sure they are identical in the configuration that influences the behaviour of the applications deployed.
There will be no surprise due to a missing configuration of a firewall port, of a datastore or a vlan trunk: consistency is granted, troubleshooting is limited.

Ensure idempotence

Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application (you can find the complete definition here). Well designed API ensure that repeated calls with the same input will not alter the state of the system, that is good because in case of retry you don't risk to create duplicated resources or other troubles.

Is Infrastructure as Code a product I can buy?

No: IaC is a methodology, not a product. It's a set of best practices that you can gradually adopt, and learn step by step. You can start with basic use cases, like creating a new tenant with a few associated resources, if that is a recurring activity and you want to make it faster and error-proof.

Then you will grow with more complex use cases, like creating an entire test environment on demand with all needed resources, including services from a cloud platform. The adoption of Infrastructure as Code is not a big bang, you don't need to build complete automation with no manual activity in one week. You can target quick wins, that validate the approach and generate momentum in the organization. The critical factor in the adoption is change management (i.e. the introduction of a new operational model and new ways of doing things), not the technology.

Of course you need supporting tools: automation frameworks like Ansible and/or Terraform (eventually also scripting languages could do the job), versioning systems, collaboration systems. And your infrastructure needs to be programmable (public cloud is programmable by default), meaning that your servers, networks, storage should be managed via software controllers or, at least, expose well-documented API.

Common use cases for Infrastructure as Code

Environments on demand

IT admins and the Operations teams receive a lot of requests from the applications teams: they need a configuration change to fix a problem or to deploy a new service, they need a new test environment, they need a clone for a new tenant, etc. Most of these requests cannot be satisfied immediately because of higher priorities, or because they require the collaboration of different teams that needs planning.

If the provisioning of the system - and some "day 2 operations" - was automated, with controlled execution of validated templates, both parties (Dev and Ops) would save time and be more satisfied... and efficient, for the benefit of the entire company.

Shared resource pool to increase efficiency

Some companies keep separate environments for each stage of a project: integration test, performance test, quality assurance, production, etc. Resources are always allocated, regardless they are only used - let's say - a week every three months (the interval may vary between one day and one year): only when they release a new version of the project, or have a maintenance windows for deploying bug fixes.

Keeping resources allocated when they are not in use is a waste of capacity, hence a waste of money. Imagine if you multiply the waste by the number of projects. But they cannot do it differently because of the complexity and the time required to build the different environments.

If they could - and with automation they can - recreate an identical environment, end to end, whenever required, they could dispose each environment as soon as it's no longer in use. Knowing that they can recreate it in minutes, they would reuse the returned resources for another stage or another project.
Using a shared resource pool (computing, networking and storage) for many project would be more efficient from a cost perspective. It applies to fixed capacity (less capex: you need to buy less hardware to satisfy all the requests) but also to pay per use scenarios (less opex: you dismiss resources when not needed).

Disaster Recovery

In case you need to repurpose existing or new resources to recover from a disaster, recreating a clone of the system from a single source of truth is much faster and safer. Generating the new infrastructure from the same blueprint that had created the old one, makes sure they are identical.

Blueprints and Compliance

Subject matter experts from every technology domain that collaborate to provision and maintain a system, instead of being engaged every time, could design and release Infrastructure as Code models once. Users (i.e. applications teams or other operations teams) could then use the blueprints for a self-service provisioning, without depending on the availability of the SME. The SME would save their time, feeling safe because the blueprints respect all the defined constraints and comply with the policies (no provisioning anarchy is allowed).

Auditing

Running automation scripts with standardised logging, or better using a pipeline orchestrator for provisioning and configuring systems, would trace what operations have been done, by whom, the input and the outcome. Very useful audit information, with no effort.

From where do I start?

Tools: Ansible and Terraform

Those are the most widely used tools for automation (with or without an Infrastructure as Code approach). They are open-source and free, easy to use. An enterprise version also exist, and in some cases you will find it very useful. But you can start practicing with the free tool and use it for years, with great advantage, if you are the only responsible for the infrastructure. In case of teamwork, you can still use the free version and dedicate some time to build your own operational model and additional tools, or you can switch to the enterprise version that makes it easy to scale at the enterprise level.

You can download the software from the Ansible and Terraform websites, along with good documentation and reusable examples (see below). Good tutorials are also available.

Single operating tool: Cisco Intersight

Cisco Intersight™ is a Software-as-a-Service (SaaS) hybrid cloud operations platform which delivers intelligent automation, observability, and optimization to customers for traditional and cloud-native applications and infrastructure. It supports Cisco Unified Computing System™ (Cisco UCS®) and Cisco HyperFlex™ hyperconverged infrastructure, other Intersight-connected devices, third-party Intersight-connected devices, cloud platforms and services, and other integration endpoints. Because it’s a SaaS-delivered platform, Intersight functionality increases and expands with weekly releases.

With Intersight, you get all of the benefits of SaaS delivery and full lifecycle management of distributed infrastructure and workloads across data centers, remote sites, branch offices, and edge environments. This empowers you to analyze, update, fix, and automate your environment in ways that were not previously possible. As a result, your organization can achieve significant TCO savings and deliver applications faster in support of new business initiatives.

Resources to practice Infrastructure as Code

DevNet - Cisco's developers community, that offers tutorials, sandboxes, labs and reusable assets. This is the Infrastructure as Code page at DevNet: https://developer.cisco.com/iac/

Terraform - documentation, download and tutorials at https://www.terraform.io/. The integration with Cisco Intersight is explained at https://www.hashicorp.com/resources/standardizing-hybrid-cloud-environments-with-hashicorpterraform-and-cisco-intersi

Ansible - documentation, download and tutorials can be found at https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html

Cisco workshops on Infrastructure as Code - feel free to contact me if you're interested in participating in our free, 3 x half days hands-on workshop.

May 25, 2020

Cost control in a multicloud world - part 2

This post is the second part of a discussion about optimization of the cloud budget and cost control in your IT organization. The first part is here.

The CloudCenter Suite

The CloudCenter Suite is a key component in this framework: it is made of three modules that offer provisioning, lifecycle automation and – what we are discussing here – cost control.

The Cost Optimizer module collects data from the API of all the cloud providers to create a series of detailed reports, that you can use to understand where and why you spend your money. You can slice and dice the information across different dimensions, and you can schedule reports or download them to feed a billing system.

complete and granular reporting of cloud spend

You can create any custom organizational hierarchy to map your business (departments, customers, projects, etc.) and give them visibility and responsibility of the budget consumption.

custom hierarchy of cost groups

budget management

The Cost Optimizer gives you recommendations for rightsizing, wherever your assets are deployed, analysing their behavior. You can evaluate the suggestion and then act manually, or you can enable the tool to do it for you automatically.

recommendations for rightsizing and other optimizations

It will also tell you when it makes sense to adopt reserved instances to get a discount, showing you how much you’re going to save.

Act now to reduce your spending… across all clouds

The best practices and the tools are there. You can choose among tools that are specific to a cloud provider (they all offer good solutions) or a cloud agnostic solution that works with all clouds.

With the Cisco CloudCenter Suite you get a fully detailed report of your inventory, all the services you are consuming everywhere, and the efficiency from a budget standpoint.

You also get actionable suggestions to save money that can be, if you configure the tool to do so, executed automatically. If you prefer, you can just get the list of suggested actions and implement them manually.

Thanks to the Cisco CloudCenter Suite you can setup a common governance model and a set of policies in one place, instead of replicating the build of reports and automation in every single cloud, based on full visibility that goes beyond the suggestions given by your service providers.

All you have to do now is to download and test the CloudCenter Suite (30 days trial), or contact us for a live demo or a discussion of your use cases.

Cost control in a multicloud world

Using public cloud is not as cheap as you expected?

Is allocating budget and verifying spending a little more difficult than they had told you?

You’ll probably find some best practices and tools for cost control in public cloud services, described in this series of two posts, useful to improve the efficiency of your consumption. They apply to every IaaS and PaaS service in a multicloud context.

What CxO expect from IT

Surveys show that the enterprise IT is getting a predominant role in front of the lines of business when they need cloud services. Not just to select the best solution for them, but to help keeping the financial aspects in control (source: RightScale State of the Cloud Report).

role of enterprise IT in cloud decisions

Looking at the initiatives, that is where budget is allocated, cost control and cost saving is on top:

Allocation of budget for cloud initiatives

Like drug dealers

Cloud providers are making a lot of money. And, of course, they want more customers and they want to retain the existing ones. Competitors steal some, and others just fly away when they realize that operating applications in production is more expensive than they had thought.

So, they offer you some candies, just to taste what a beautiful trip they can offer you. Even with a free account, you get amazing services that work very well. See also Azure and Google. They all make building business applications quick and easy.

appealing services offered by AWS

Developers are attracted by that, and they love creating cloud native applications using PaaS services. But they do not realize, or they just don’t care of, how expensive it will be to run applications in production.

In addition, they generate a lock-in because PaaS services are not portable across clouds.

Is public cloud easier or is it cheaper?

Now ask yourself this question: do you use the public cloud because it’s easier or because it’s cheaper? Or maybe you’re just lazy and you want to delegate the SLA responsibility to the service provider?

I will not discuss if it’s easier in this post. But, for sure, it’s not cheaper. Budgets go out of control very often.

Those are some tips from Amazon (source: AWS re:Invent2019)

tips for saving from AWS

Of course, cloud providers want to help you to be efficient but… not too much! The two green actions look like the most effective, now let’s see if we can do anything better.

Cost control framework

We could create a cost control framework, as the large company I took this information from.

cost saving framework

We focus mostly on two aspects: avoidance of unnecessary costs and iteration of the improvement actions.

In the avoidance best practice, we concentrate on the efficiency of VM snapshots retention: how many and how long. Storage volumes and public ip addresses should be monitored, because they generate a cost. When you use a S3 storage you can optimize the cost based on frequency of access, speed and size. And VM should not be larger than need, because you pay also the capacity in excess. Suspending VM at night, in the weekend or whenever the business application is not used, also saves money. And so does scaling capacity dynamically

You should monitor the situation carefully, and automation helps here. You can setup alerts and governance policies, having reports sent to the stakeholders. Automation in provisioning, but mostly in reporting and resource adjustment is key.

You can obtain important savings

This is what cost optimization saved in the context of a large SaaS service operations. Most of the saving comes from adoption of reserved instances where appropriate and from cleaning environments used for proof of concepts. The total saving is $7M per year.

savings from cloud optimization

And this is what they are planning: choosing the right instance types, rightsizing, cleaning up will save $7M more per year.

expected savings from recurring optimization

Cost control: a solution

We will examine a simple 3 steps process to save your budget.

Understand your assets and ecosystem
Optimize based on best practices, on investigation through your assets or... recommendations from a tool
Avoid unnecessary costs and iterate of the improvement actions

Easy principles to set a governance model are:

Define (and enforce) budgets for your customers, projects and departments.
Monitor and report about that.
Set policies to automate suspension, autoscaling and cleanup.

Cisco offers a solution that helps you to achieve all that, quick and easy.

We have built a framework that is based on the feedback loop suggested by best practices (e.g. the second principle of DevOps recommends looking at the entire system as a whole and iterating the optimization).

the optimization loop

In next post you will find a solution for implementing your cost control quickly and consistently across all the clouds you use: any combination of private cloud technologies and public clouds, with no need to adopt specific solutions for each individual target.

Keep reading, it will prove useful for your strategy.

Pages