Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-29421

Docker GC task may delete many layers if network/file system issues happened at the scheduled time

    Details

    • Story Points:
      5
    • Sprint:
      NXRM MadMax Sprint 27, NXRM MadMax Sprint 28, NXRM MadMax Sprint 29, NXRM MadMax Sprint 30
    • Notability:
      3
    • InvestmentLayer:
      support-escalated
    • Aha Concept:
      non-concept

      Description

      SYMPTOM:

      The daily scheduled "Docker - Delete unused manifests and images" (docker GC) task deleted 18,000 assets from the docker repositories. The repository.docker.gc-20211010070000011.log and nexus.log shown so many "Attempt to access non-existent blob", which indicate there might be some file system issue at the time of this task run.

      REPRODUCE STEPS:

      1. Create a new file blobstore "test" (in real scenario, should be S3 or NFSv4)
      2. Create a new docker hosted repository "docker-test-hosted" with above blobstore
      3. Upload some image, eg: alpine:3.13, and make sure this image can be pulled
      4. Wait for at least one hour (minimum offset hour)
      5. Rename ..../blobs/test/content to blobs/test/content_orig (simulating short network mount issue)
      6. Run the docker GC task against "docker-test-hosted"
      7. Rename blobs/test/content_orig to blobs/test/content
      8. Try pulling alpine:3.13 from docker-test-hosted with a different PC/client

      My pull request got 404:

      172.17.0.1 - admin [01/Nov/2021:01:41:07 +0000] "GET /repository/docker-test-hosted/v2/alpine/blobs/sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec HTTP/1.1" 404 - 156 14 "containers/5.15.2 (github.com/containers/image)" [qtp47466571-798]
      

      Because it was already deleted by the task:

      {"timestamp":"2021-11-01 01:37:29,972+0000","nodeId":"1A0F6665-10338CD1-99AF1787-30A58F56-8DC69649","initiator":"*TASK","domain":"repository.asset","type":"deleted","context":"v2/-/blobs/sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec","thread":"quartz-17-thread-20","attributes":{"repository.name":"docker-test-hosted","format":"docker","name":"v2/-/blobs/sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec"}}
      

      EXPECTED BEHAVIOUR:

      Similar to the Deploy Offset hour, the docker GC task must have another logic to prevent the mass accidental deletion (eg: introducing the deletion ration. If X % of assets are going to be deleted, the task should fail. Another idea is checking if the content directory is readable first.)

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              vgrab Vladimir Grab
              Reporter:
              hosako Hajime Osako
              Last Updated By:
              Michael Oliverio Michael Oliverio
              Team:
              NXRM - Mad Max
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Date of First Response:

                  tigCommentSecurity.panel-title