专题出品人: 洪小军

美图公司 美图云技术副总裁

专题:混沌工程

混沌工程的实验理论在 2017 年被 Netflix 及相关团队提出,目标是希望通过在生产环境周期性引入故障的方式,验证系统对非预期故障防御的有效性。相比于被动的应对故障,在可控的影响下实施混沌工程实验,提前揭示系统弱点,可以增强我们对系统可恢复性的信心。混沌工程属于一个新兴的技术领域,行业认知和实践积累比较少,大多数IT团队对它的理解还没有上升到一个领域概念。本次专题我们将从多方面对这一新兴领域做出解读。

本专题下的议题

Chaos Engineering – past, present and future
Vilas Veeraraghavan Walmart Labs Director Of Engineering
所属专题:混沌工程

课程概要

A large number of companies have actively reduced their dependence on a managed data center solution and instead have migrated to a cloud native solution for all of their software needs. The rush to make massive ecosystems of micro services in the cloud has resulted in creating extremely complex cross connections that are not necessarily well-designed. This leads to customer facing outages and problems caused by small glitches in the system which require hours of debugging and almost always result in lost revenue. To prevent this, we propose a solution - create controlled Chaos in the system and learn where your weaknesses are before your customers do. This field of engineering is called - Chaos engineering (or resilience engineering). This is a rapidly growing discipline that began at Netflix and has now spawned an entire industry on its own.

In this talk, I will present the history and the primary motivations that propelled the movement for chaos engineering. I will also touch upon the innovations, the state of the industry and the popular products that are being used to adopt this discipline in companies today. I will draw on my experiences at Netflix and Walmart labs to present a picture of the future where chaos engineering will become a staple for any cloud delivery platform.

听众收益

Learn about Chaos engineering – what were the motivations for it, where we are today and where we are going. Learn how to implement it in your own company and reap the benefits of getting resilient

分布式服务下的混沌工程实践
肖长军 阿里巴巴 高可用架构部门高级开发工程师
所属专题:混沌工程

课程概要

背景介绍:
在微服务系统的大环境下,系统间的依赖已日益复杂,可能没有人能说清单个故障发生对整个系统的影响。传统的测试更多的是验证各个服务的功能和性能瓶颈,但单个微服务故障可能会影响整个服务不可用,减少故障的最好方法就是让问题经常性的发生。所以落地混沌工程,在可控范围或环境下,通过不断重复失败过程,持续提升分布式系统的容错和弹性能力。

解决思路/成功要点:
1.快速有效的搭建一个混沌实验平台
2.梳理核心链路服务
3.确定服务的稳态、自动容错方案和预期业务影响
4.修复发现的问题,持续演练
5.组织演练突袭,做到以战养战

成果:
通过混沌工程,提升了主链路服务的容错能力,改善了监控的有效性以及锻炼了相关人员定位与解决问题的应急能力,并沉淀出一套混沌工程工具 chaosblade,服务于混沌工程社区,同时依靠社区的力量完善更多的混沌实验场景,共同推进混沌工程领域的发展。

听众收益

1.了解混沌工程是什么
2.了解分布式服务下,混沌工程的价值
3.企业中该如何开展混沌工程

议题即将上线
Mikolaj Pawlikowski Bloomberg Software Engineer Project Lead at Bloomberg LP
所属专题:混沌工程

课程概要

即将上线

听众收益

即将上线

负责美图整体后端基础设施、平台服务及其创新技术等方面的研发和管理工作,负责美图区块链技术整体事务。曾负责新浪微博架构平台团队,推动微博平台基础设施及其部分核心业务的研发和落地。超过十年亿级以上用户高并发的大型互联网系统架构设计和研发经验

专题:混沌工程

混沌工程的实验理论在 2017 年被 Netflix 及相关团队提出,目标是希望通过在生产环境周期性引入故障的方式,验证系统对非预期故障防御的有效性。相比于被动的应对故障,在可控的影响下实施混沌工程实验,提前揭示系统弱点,可以增强我们对系统可恢复性的信心。混沌工程属于一个新兴的技术领域,行业认知和实践积累比较少,大多数IT团队对它的理解还没有上升到一个领域概念。本次专题我们将从多方面对这一新兴领域做出解读。

其他相关专题

CopyRight © 2008-2019 Msup & 高可用架构