How to resolve a “Sourcegraph.com is deleted entirely” incident

Assess in which way it is deleted entirely

  • Navigate to the sourcegraph-dev project and look at the existing Kubernetes clusters. Does the cloud cluster exist still?
    • No, the cloud cluster is gone:
      • Do the disks for the now-deleted cluster nodes still exist? Check by navigating to Compute -> Disks and searching for pDName search
        • Yes, the disks still exist: Go to Recreating GKE cluster and follow the with existing disks steps.
        • No, the disks are gone: Go to Recreating GKE cluster and follow the from snapshots steps.
    • Yes, the cloud cluster exists: Go to Recreating Kubernetes objects

Recreating GKE cluster

We use Terraform to manage our infrastructure

  1. Navigate to the cloud repo
  2. Follow the instructions there to run terraform plan to see if the infrastructure has drifted from what is specified there.
  3. Run a terraform apply to reconcile the infrastructure to its definition in code.
  4. With existing disks, goto recreate the Kubernetes objects:
  5. From snapshots, goto restore the disks from snapshots
  6. Go to Confirm health of Sourcegraph.com

Recreating Kubernetes objects

  1. Navigate to the cloud cluster on the Google Cloud console and click Connect, run the `gcloud command it gives you.
  2. kubectl -n prod get deployments should show partial or no Kubernetes deployments, but that you are connected to the right cluster.
  3. In the https://github.com/sourcegraph/deploy-sourcegraph-cloud repository’s latest release branch, run kubectl-apply-all.sh which will recreate all Kubernetes objects.
  • Sourcegraph.com uses static disk attachments, so the volumes should still be valid and no data should have been lost.

Go to Confirm health of Sourcegraph.com

Restore disks from snapshots

  1. We use Velero to manage our disaster recovery process.

  2. Navigate to the cloud cluster on the Google Cloud console and click Connect, run the gcloud command it gives you.

  3. Ensure you have Velero installed locally (brew install velero)

  4. Check to see if the velero namespace exists. kubectl get ns velero

  5. If it does not, you need to install and configure Velero.

    gcloud config set project sourcegraph-dev
    
    SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list \
    --filter="displayName:Velero service account" \
    --format 'value(email)')
    
    gcloud iam service-accounts keys create credentials-velero \
    --iam-account $SERVICE_ACCOUNT_EMAIL
    
    velero install \
    --provider gcp \
    --plugins velero/velero-plugin-for-gcp:v1.4.0 \
    --bucket sg-velero-cloud-backup \
    --secret-file ./credentials-velero
    
  6. Following the velero restore documents steps. a. First, patch the backup location

    kubectl patch backupstoragelocation default \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadOnly"}}'
    

    b. Find the most recent backup with velero backup get and run velero restore create --from-backup <BACKUPNAME> c. Finally, revert the accessMode

    kubectl patch backupstoragelocation default \
    --namespace velero \
    --type merge \
    --patch '{"spec":{"accessMode":"ReadWrite"}}'
    
  7. Goto confirm the health of Sourcegraph.com

Confirm health of Sourcegraph.com

Follow the documented regular incident follow-up procedures.