Uploaded image for project: 'Dev - Nexus Repo'
  1. Dev - Nexus Repo
  2. NEXUS-34608

cleanup policies may cause PostgreSQL database connection pool exhaustion and prevent restarts

    Details

      Description

      Summary

      Cleanup policies may exhaust the database connection pool and lead to product instability, requiring DBA intervention to terminate running queries in Postgresql before NXRM can be safely restarted.

      Problem

      A customer was running 3.38.1 with the default database connection pool size of 100 connections. NXRM was running in a pod inside Kubernetes.

      The instance was performing OK for many months, without any cleanup policies configured.

      Then a new cleanup policy was added to a Maven 2 "snapshots" version policy hosted repository. The criteria was only this one criteria:

      Component Usage: Remove components that haven't been downloaded in 30 days.

      The cleanup task that processes cleanup policies ran. A few hours later builds started to fail because GET requests to download components were not completing in a timely manner and user authentications were failing. other database operations requiring a database connection were not completing

      In an attempt to restore stability to the instance, the pod was killed where NXRM was running and multiple restarts attempted because startup would never complete. NXRM would begin to start but always stop logging anything at this log line during startup:

      INFO  [FelixStartLevel]  *SYSTEM org.sonatype.nexus.datastore.mybatis.MyBatisDataStore - nexus - Creating schema for Maven2ComponentDAO
      

      Diagnosis

      Examining the pg_activity table on PostgreSQL for active queries from NXRM during the problem showed the following:

      One query like this:

      DELETE FROM maven2_component WHERE component_id = ANY($1::int[])
      

      99 other queries doing this:

      UPDATE maven2_asset SET last_downloaded = CURRENT_TIMESTAMP, last_updated = CURRENT_TIMESTAMP
      

      9 of these queries initiated from each of 9 restart attempts of NXRM after the initial incident:

      CREATE INDEX IF NOT EXISTS idx_maven2_component_kind ON maven2_component (kind)
      

      Recall the default NXRM DB connection pool size is defaulted to 100.

      The nexus.log at time of incident starting reporting a variety of different queries failing because of:

      Error querying database.  Cause: java.sql.SQLTransientConnectionException: nexus - Connection is not available, request timed out after 30000ms.
      

      Not until the original NXRM database session/queries were terminated at the database level could NXRM be successfully restarted.

      Expected

      Cleanup policy delete statements need to perform in such a way that

      • there is little to no risk of exhausting the NXRM database connection pool
      • be short running and help PostgreSQL eventually terminate related queries on its own or better detect when NXRM has been terminated itself or terminated the task performing queries - although if PostgreSQL can do this depends
      • statements do not block for long period of times other update or table modification statements related to components or assets

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              plynch Peter Lynch
              Last Updated By:
              Peter Lynch Peter Lynch
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Date of First Response:

                  tigCommentSecurity.panel-title