docs/TechRef/PureSQLSearch.adoc

   1 == Pure-SQL searching
   2
   3 === Background
   4 Evergreen stores all data, including that useful for patron and staff search,
   5 in a normalized schema that is time and space efficient for transactional use
   6 cases, and provides guarantees on data integrity.  In addition, development is
   7 made simpler than would be the case otherwise and arbitrary reporting is made
   8 possible.
   9
  10 However, this structure is not effective for direct, SQL-only search
  11 functionality in a hierarchical, consortial dataset.  This is a problem that
  12 is relatively unique to Evergreen, as it is most often employed to host and
  13 serve large consortia with overlapping bibliographic datasets and
  14 non-overlapping item and location datasets.  Other search engines, including
  15 those built into other ILSs, do not generally have to account for
  16 hierarchically organized location visibility concerns as a primary use case.
  17 In other words, because it provides functionality that requires a hierarchical
  18 view of non-bibliographic data, a problem space for Evergreen is essentially
  19 nonexistent in competing products.
  20
  21 Evergreen's search infrastructure has evolved over the years.  In its current
  22 form, the software first performs a full text search against extracted
  23 bibliographic data and limits this initial internal result set to a
  24 configurable size.  It then investigates the visibility of each result on
  25 several non-bibliographic axes.  These visibility tests take up the
  26 preponderance of CPU time spent in search, with full text search of the
  27 bibliographic data generally completing within milliseconds. The main reason
  28 this multi-stage mechanism is used is that there are many visibility axes and
  29 attempting to join all required data sources together in a single query will
  30 cause the search use case to perform very poorly.  A previous attempt to
  31 create a pure SQL search mechanism failed for this reason.
  32
  33 A significant drawback of the current approach is that the costs imposed by
  34 visibility filtering search results using normalized non-bibliographic data,
  35 either in-query or separated from the main full-text query as it is today,
  36 make it necessary to place limits on the number of database rows matched by
  37 full-text query constructs.  This in turn can cause searches to omit results
  38 in certain situations, such as a large consortium consisting of a few large
  39 libraries and many small libraries.
  40
  41 However, it has been shown possible to overcome this performance issue by
  42 providing an extensible way to collect all visibility related information
  43 together into a small number of novel data structures with a compact in-memory
  44 representation and very fast comparison functions.  In this way, we are able
  45 to use pure SQL search strategies and therefore avoid result visibility
  46 problems while also benefiting from improvements to the core PostgreSQL
  47 database engine.  Further, this will open the door to indexing improvements,
  48 such as removal of the need for duplicate data storage, or the use of non-XML
  49 data storage schemes, which could reduce resource requirements and have a
  50 direct, positive effect on patron and staff search experience.
  51
  52 === Overview of existing search logic
  53
  54 . Construct core bibliographic search query
  55 . Collect non-bibliographic filtering criteria
  56 . Pass query and filters to a database function
  57 . Calculate hierarchical location information for visibility testing
  58 . Open cursor over core query, limited to *superpage_size * max_superpages* records
  59 . Core query implements bib-level sorting
  60 . For each result
  61 .. NEXT if not on requested superpage
  62 .. Check deleted flag, based on search type
  63 .. Check transcendence
  64 ... Return result if true
  65 .. Check for direct Located URI in scope
  66 ... Return result if exists
  67 .. Check copy status + (circ lib | owning lib) based on modifier
  68 .. Check peer bib copy status + (circ lib | owning lib) based on modifier
  69 .. Check copy location based on filter
  70 .. Check peer bib copy location based on filter
  71 .. General copy visibility checks
  72 ... If NOT staff
  73 .... Check for OPAC visible copies (trigger-maintained materialization)
  74 .... Check for peer bib OPAC visible copies
  75 ... If staff
  76 .... Confirm no copies here
  77 .... Confirm no peer bib map
  78 .... Confirm no copies anywhere
  79 .... Confirm no Located URIs elsewhere
  80 .. Return result if not excluded
  81 . Calculate summary row
  82
  83 === Overview of new mechanism
  84 Record and copy information (everything checked in *(7)* above) is collected
  85 into a novel data structure that allows all visibility-indicating criteria to
  86 be flattened to integer arrays.  This is facilitated by a database trigger in
  87 much the same way that basic OPAC copy visibility is collected for copies
  88 today.
  89
  90 Most identifiers in Evergreen are stored as signed integers of either 32 or 64
  91 bits.  The smaller 32 bit space allows for approximately two billion positive
  92 entries, but all identifiers for table rows that are used as visibility axes
  93 fall into a range of between one and one million for all applicable use cases,
  94 and all identifiers of interest are positive.  Therefore, we can make use of
  95 the most significant bits in an integer value to create a per-axis namespacing
  96 mask.  When applied to the identifier for a visibility axis identifier, this
  97 mask allows two values that are identical across axis to be identified as
  98 unique within a combined set of all values.
  99
 100 Specifically, we retain the four most significant bits of the integer space
 101 and create from that 16 potential bitmasks for per-axis segregation of
 102 identifiers.  Further, we separate copy-centered axes and bibliographic
 103 record-centered attributes into two separate columns for storage purposes,
 104 which means we can use the same four bits for different purposes within each
 105 copy or bib set.
 106
 107 In order to implement existing visibility tests with this infrastructure, six
 108 copy axes and two record axes are used from the possible 16 from each set.
 109 See the search.calculate_visibility_attribute() for details.  By using 32 bit
 110 integers we can collect all of the bitmasked values of each type (copy or bib)
 111 into a single integer array and leverage the Postgres intarray extension to
 112 test all axes at once.
 113
 114 At search time, required and user-requested visibility restrictions are
 115 converted to *query_int* values. Results are directly filtered based on these
 116 calculated *query_int* values.  This works in a way analogous to record
 117 attribute filtering, avoiding the need to test statuses, circ and owning
 118 library visibility, copy locations and location groups, copy OPAC visibility,
 119 peer bibliographic record, Located URIs, or bibliographic record sources
 120 directly.
 121
 122 === Minimum Postgres version requirement
 123 Due to features, particularly functions, available only in 9.4 and newer that
 124 are key to the performance of the new method, Postgres 9.4 will need to be the
 125 new lowest supported version for use with Evergreen.  While some of the new
 126 features and functions could be implemented as user-defined functions in
 127 PL/PGSQL, they would not be fast enough to make this pure-SQL search viable.
 128
 129 Among the important improvements that Postgres 9.4 and newer versions bring to
 130 Evergreen are:
 131
 132 * Version 9.4 improved GIN indexes in ways that directly benefit Evergreen, as well as how anti-joins are planned which matters for some Evergreen searches.
 133 * Version 9.5 introduced many general performance improvements, especially for joins and sorting, and brought planner improvements that impact complex queries such as those generated by this code.
 134 * Version 9.6 delivered more general performance improvements, particularly for large servers such as those that Evergreen databases tend to live on, as well as more improvements to GIN indexes, executor changes that can avoid unnecessary work in search queries, new built-in full-text phrase searching, and initial parallel query execution.
 135
 136 === Performance
 137 The cost of the non-bibliographic filter value caching maintenance process is
 138 10-40% faster than existing partial caching logic which it would replace.
 139
 140 The new code achieves up to 10% faster search times than the old, suboptimal
 141 mechanism time for broad searches.  The new code is faster for more selective
 142 searches, often by up to 90% faster.  In both broad and narrow search cases
 143 the new mechanism performs with complete accuracy and does not miss
 144 small-collection hits in large consortia as the existing code does.
 145
 146 Unsurprisingly, and in addition to the above improvements, performance is
 147 improved marginally as each successive Postgres version at and beyond 9.4.
 148
 149 === Page rendering changes
 150 Previously, Evergreen would request the record details for a user-visible page
 151 of results in parallel, and then, serially, request the facet data for the
 152 result set.  Now, the facet data is requested asynchronously in the background
 153 and then a single feed containing all records on a result page is requested
 154 synchronously.  By parallelizing the result and facet metadata, page rendering
 155 time is cut down significantly.  Concurrent requests of the same bibliographic
 156 record are shared between apache backends to reduce result request time, and by
 157 making one request instead of ten simultaneously, database load is reduced.  A
 158 performance improvement of up to 20% in post-search page rendering time is seen
 159 from this change.
 160
 161 Additionally, cross-apache caching of ancillary data, such as the coded value
 162 map and other data, via memcache significantly reduces the average page
 163 rendering time not just for result pages, but most pages generated by
 164 Evergreen.  An additional performance improvement of up to 50% in post-search
 165 page rendering time is seen from this change.
 166
 167 While these changes are not directly related to the removal staged search, they
 168 touch areas impacted by core search changes and provided enough improvement
 169 that implementing them concurrently with the elimination of staged search
 170 seemed optimal.
 171
 172 === User visible configuration changes
 173 The stock configuration now provides an increased value for *max_superpages*
 174 in opensrf.xml.  The default is now 100, and the *superpage_size* remains
 175 1000, for a total limit of 100,000 hits per search.  This is not a limit on
 176 visibility per se, as all records are visibility tested and ranked before
 177 limiting, but simply a limit on the number of pages a user could click through
 178 before reaching the end of the presented result list.
 179
 180 === Tuning sensitivity
 181 User-level timeouts are still possible with both the old and new code, given a
 182 large enough dataset, a broad enough query, and a cold cache.  However, the
 183 *gin_fuzzy_search_limit* GUC can be used to set a time cap on the new
 184 mechanism. See https://www.postgresql.org/docs/9.6/static/gin-tips.html for
 185 background, though the suggested values in the documentation are significantly
 186 lower than would be readily useful for a large Evergreen instance.
 187
 188 Because it uses a more complex query structure, the new mechanism is somewhat
 189 more sensitive to Postgres tuning in general.  In particular, lowering
 190 *random_page_cost* from the default of *4.0* to a more reasonable *2.0* is
 191 important for proper query planning.  For Evergreen use cases where the search
 192 indexes and relevant tables are kept in RAM or SSDs are used for storage, this
 193 value is acceptable and useful in general.
 194
 195 === Funding and development
 196 This project was funded by MassLNC and developed by Equinox Open Library
 197 Initiative.