Affects Version/s: 3.34.1
Fix Version/s: None
Sprint:NXRM Immortals Sprint 42, NXRM Immortals Sprint 43, NXRM Immortals Sprint 44, NXRM Immortals Sprint 45
The Admin - Change repository blob store task may be stopped before it has a chance to finish moving all blobs from source blobstore A to target blobstore B. The reason for stopping can include:
- unexpected task error
- nexus Repo instance shut down
The very first thing this task does when it starts running is change the repository blobstore from A to B. So when the task stops, the repository is configured to believe all of its blobs are to be found in Blobstore B, but in actual fact some are still left in blobstore A.
In this partial-moved state, builds and API requests against the repository can fail, because the inbound request will cause repo to only look inside blobstore B for the matching blob. The left-behind blobs are essentially lost.
Further blob references in the database that cannot be matched to actual blobs in a blobstore may eventually be pruned out of the database by other normal operations and tasks intended to put the database into a healthy state ( ie. reconcile task ). For this reason this partially moved blob state is dangerous to customer data.
Existing recovery procedures known to be used in this situation are fragile. For example
One could try to run the task in the opposite direction to move blobs now in B back to A. This risks encountering the same problem should the task stop. Depending on why the task stopped in the first place ( task error), or why the task was used in the first place ( maybe source blobstore was out of disk ) may be impractical.
Recovery Option 2: Combination of manually executed database queries, move blobs out-of-band, and running rebuild tasks
This option is a one-off solution and as such has its own risks. it can involve many individual risky steps.
There should be a more resilient process for changing a repository blobstore and moving its blobs. Stopping of the task cannot be avoided in all cases and the process should expect and plan for this error state. The steps to recover from this need to be simple and effective, avoid data loss risks and allow for fast recovery.