Scenario
Currently, users experience various issues like network timeout and OpenAI issues which lead to large repository embeddings as the embedding jobs run for a long time (sometimes more than a day too depending on the size). This article also covers steps to safely migrate to cloud object storage from embeddings stored in default blob store storage.
Error messages encountered
Error embedding repository: error while getting embeddings: embeddings: POST "https://cody-gateway.sourcegraph.com/v1/embeddings": failed with status 500: {"error":"embeddings: POST \"https://api.openai.com/v1/embeddings\": failed with status 500: {\n \"error\": {\n \"message\": \"The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID a4328f0d6bb86af042f3bec1b5d0cb3c in your email.)\",\n \"type\": \"server_error\",\n \"param\": null,\n \"code\": null\n }\n}\n"}
Explanation
The errors for long-running embedding job failures could be due to infra/network issues on the client side or OpenAI outages etc.
The Migration from existing default storage to cloud object storage is a bit tricky so details are shared around this process.
Resolution
Sourcegraph has released a fix in the 5.2 release to add more tolerance to failures related to client-side/OpenAI issues that would save the entire embedding job from failure also we can set SRC_HTTP_CLI_EXTERNAL_TIMEOUT
env var to higher values like 1 hour or more which needs to be set worker
in the deployment config. One of our clients has used this fix and successfully embedded a repo of size 13 GB in prod.
Steps to safely migrate data storage from internal to cloud for already existing embeddings
1) Disable incremental embedding in site-config by setting below values
"embeddings": { "incremental": false }
and delete the embedding policy.
2) Cancel all processing/queued jobs from UI.
3) Enable embedding pod and Configure new S3 object storage location, follow this doc on how to configure external object storage.
4) Copy indexes from blob store to new S3 object storage.
5) Enable the incremental indexing and create policy again.
6) Monitor embedding jobs, the repositories should start queueing up for new embeddings.