Incident response use case

This page provides a strategic introduction to one of our company use cases. Check that link to find the rest of the use cases and learn how we use them as part of our company strategy.

Sponsors

This use case has sponsors who help maintain it. If you have questions or suggestions, you can reach out to them.

Overall vision

Sourcegraph provides a collaborative platform that helps devs understand why the problem is occurring and its potential impact on other services–which are crucial for resolving incidents caused by bad code changes in distributed systems. Sourcegraph helps assure incident responders that all holes are plugged, using insights and monitoring to provide confidence of resolution. While other tools only deal with runtime, not code, Sourcegraph helps devs identify the root cause in code and fix the issue everywhere so it won’t reoccur.

Why this is important

  • In the DevOps model, which is widely accepted in the industry, devs are expected to be on-call and respond to incidents affecting the service they build (instead of handing off those responsibilities to a separate non-dev “Ops” team as was done pre-DevOps. They are responsible for mission-critical production services.
  • Incident response often relies on heroic efforts by individual developers rallying around an incident. This is not scalable nor is it sustainable because team members can be unreachable at times and no one team member should be a single point of failure as can often happen
  • Developers are lacking a single pane of glass to provide them with clarity and visibility that is universal, in a stressful situation where every minute matters
  • To be successful, development teams need to stop the impact of the incident, but also analyze the root cause to ensure it does not repeat itself. This is a complex feat that demands time and effort

Sourcegraph is essential to how the Cloudflare security team addresses security risks and root-causes incidents. David Haynes, a Security Engineer at Cloudflare, says “When a potential security issue comes up, I often have to go into another engineer’s project to quickly understand how the code works to understand the critical functions, where the data is flowing, what sort of controls or checks are happening. With Sourcegraph, I can jump into another engineer’s project and quickly explore and better understand the code faster.” Read more at this Cloudflare case study.

How we solve this today

  • Solving an incident is a collaborative effort that must be documented. Sourcegraph aids in these efforts by allowing users to share links to searches and files and by recording work done in a search notebook.
  • IR efforts often require reports. This could include impacted files, repositories or other artifacts Sourcegraph helps generate.

Who benefits

Developer:

  • Get the incident fixed faster

Engineering Leader

  • Meet the organization’s SLOs; avoid negative business impacts from prolonged downtime; faster resolution time; fewer escalations.

Features that enable this use case

Additional resources