Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-29084

Data lost if "change repository blob store" task stops before it finishes

    Details

    • Story Points:
      2
    • Sprint:
      NXRM Immortals Sprint 42, NXRM Immortals Sprint 43, NXRM Immortals Sprint 44, NXRM Immortals Sprint 45
    • Notability:
      2

      Description

      Problem

      The Admin - Change repository blob store task may be stopped before it has a chance to finish moving all blobs from source blobstore A to target blobstore B. The reason for stopping can include:

      • unexpected task error
      • nexus Repo instance shut down

      The very first thing this task does when it starts running is change the repository blobstore from A to B. So when the task stops, the repository is configured to believe all of its blobs are to be found in Blobstore B, but in actual fact some are still left in blobstore A.

      In this partial-moved state, builds and API requests against the repository can fail, because the inbound request will cause repo to only look inside blobstore B for the matching blob. The left-behind blobs are essentially lost.

      Further blob references in the database that cannot be matched to actual blobs in a blobstore may eventually be pruned out of the database by other normal operations and tasks intended to put the database into a healthy state ( ie. reconcile task ). For this reason this partially moved blob state is dangerous to customer data.

      Existing recovery procedures known to be used in this situation are fragile. For example

      Recovery Option 1: Use Admin - Change repository blob store task in the opposite direction

      One could try to run the task in the opposite direction to move blobs now in B back to A. This risks encountering the same problem should the task stop. Depending on why the task stopped in the first place ( task error), or why the task was used in the first place ( maybe source blobstore was out of disk ) may be impractical.

      Recovery Option 2: Combination of manually executed database queries, move blobs out-of-band, and running rebuild tasks

      This option is a one-off solution and as such has its own risks. it can involve many individual risky steps.

      Expected

      There should be a more resilient process for changing a repository blobstore and moving its blobs. Stopping of the task cannot be avoided in all cases and the process should expect and plan for this error state. The steps to recover from this need to be simple and effective, avoid data loss risks and allow for fast recovery.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              rseddon Rich Seddon
              Last Updated By:
              Alexandre Santos Alexandre Santos
              Owner:
              Michael Kearns Michael Kearns
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

                Dates

                Created:
                Updated:
                Date of First Response:

                  tigCommentSecurity.panel-title