docs/TechRef/LinkChecker.txt

   1 URL Verification
   2 ================
   3
   4 Evergreen now has ability to verify URLs, which it is hoped, will be of
   5 particular benefit to locations with large electronic resource collections.
   6
   7 Overview
   8 --------
   9
  10 In order to support verification of URLs, Evergreen now has several new
  11 capabilities, and extensions to some existing features.
  12
  13 A wizard-style interface that walks a staff member through the process of collecting records and URLs to verify, verifying and reviewing the URLs.
  14
  15 URL validation sessions are built as a whole to support immediate and
  16 future review of any URLs.  Each session carries a name, an owner, a set
  17 of record search criteria, a set of tag and subfield combinations describing
  18 the location of URLs to validate, a record container for tracking individual
  19 records to verify, and a set of state and data tables for managing the
  20 processing of individual URLs.
  21
  22 A set of middle-layer methods provide the business logic required to collect
  23 records, extract, parse and test the validity of the URLs.
  24
  25 Workflow
  26 --------
  27
  28 URL verification and update are be performed as a series of coordinated phases.
  29
  30     * Phase 1 - Select or Create a session
  31         ** Collect the owner and name of the session, providing appropriate defaults
  32         ** Collect a set of saved and immediately-entered searches for the purpose of targeting records, and store a single derived search
  33         ** Collect a set of tag and subfield combinations describing the locations of interest that contain URLs within records found by the above search. Store these as xpath.
  34         ** Offer a "Process immediately" option to skip Phase 4 (not skip to Phase 4, but skip Phase 4 itself) -- See below for details.
  35     * Phase 2 - Search for and collect records of interest
  36         ** Create a new bib record container of type "url-validate" and link this to the session record created in Phase 1
  37         ** Run the search, and store the full set of record IDs in the "url-validate" container
  38     * Phase 3 - Extract URLs from collected records
  39         ** Inspect each record that we just placed into the container, and use the tag/subfield XPath expressions to extract any URLs
  40         ** Extract any relevant data and store
  41             *** Container entry pointing to the record from which it came
  42             *** Tag and subfield from which it came
  43             *** The full content
  44             *** Scheme
  45             *** Host
  46             *** Domain
  47             *** TLD
  48             *** Path
  49             *** Page (last component of Path)
  50             *** Query
  51             *** Fragment
  52     * Phase 4 - Search/Filter/Sort URLs
  53         ** Skip this step if "Process immediately" was selected during session setup in Phase 1
  54         ** Else, display an interface for selecting which URLs and records to process based on any component extracted and stored during Phase 3.
  55     * Phase 5 - Validate Selected/All URLs
  56         ** Accept a list of extracted URL IDs from the previous step, or if no filtering done, all URLs
  57         ** For each unique URL in the set, make a HTTP HEAD request to test validity
  58             *** YAOUS for timeout value
  59             *** YAOUS for sleep period between each URL test
  60             *** For duplicated URLs, test only once and share the result across all instances
  61             *** Avoid testing URLs having the same domain sequentially
  62         ** Store HTTP response code
  63         ** IFF 3XX (redirect) code is returned
  64             *** Parse the new target URL, linking to the original
  65             *** Repeat as necessary, up to a sanity limit on redirect depth (YAOUS)
  66             *** Use a lookup table of hashes of URLs already redirected to (for the given original URL), to avoid loops
  67     *  Phase 6 - Validation Status Report
  68         ** Display a summary breakdown of HTTP statuses and overall completion
  69         ** Offer an interface to Search/Filter/Sort URLs for inspection based on any component extracted and stored during Phase 3 as well as HTTP status of the originally extracted URL.  Included in this display should be the endpoint of any HTTP redirection the server requested.
  70
  71 Development Breakdown
  72 ---------------------
  73
  74     * Database -- The database has been augmented with new tables to store and process bibliographic records as they are collected for URL verification
  75         ** Session state
  76         ** Session configuration
  77         ** Container type
  78         ** URL data and state storage
  79     * Middle Layer -- Several new API calls, culminating in a new OpenSRF service created to implement the required business logic.
  80         ** Session and Configuration management
  81         ** Session selection
  82         ** Record discovery
  83         ** Progress calculation
  84     * User Interface -- Several new interface components have been created to drive the use of the new OpenSRF APIs
  85         ** Session management
  86         ** Configuration management
  87         ** Extracted URL display (Search/Filter/Sort)
  88         ** Summary status display
  89
  90 User Interface
  91 --------------
  92 The user interface embodies the workflow section above.  Displays of URLs for verification and then post-verification review make use of openils.widget.FlattenerGrid.
  93
  94 First Change to FlattenerFilterDialog
  95 -------------------------------------
  96
  97 FlattenerFilterDialog has gained the ability to save a set of filter conditions via a Save button (optionally displayed) which calls a callback at onClick.
  98
  99 FlattenerFilterDialog now has a clean way to load a saved set of filter conditions (this part should be largely there already, see Trigger Event Log for similar).
 100
 101 The mechanism to which this instance of FlattenerFilterDialog saves sets of conditions (and from which it will load them) uses a dialog that allows a user to choose sets of conditions to load and uses a DB table to store them in.
 102
 103 Second Change to FlattenerFilterDialog
 104 --------------------------------------
 105
 106 We also now support IN and NOT IN operators.  The operand widget is be the same as if for a typical unary operator (any of them but 'between') plus an adder (label probably '[+]') and a multiselect, the valueset of which is augmented with every click of the adder.
 107
 108 Here's why this is needed.  Imagine needing to filter URLs in this way: say you want urls only from the "http" scheme and matching neither the domain example.com nor example.net.  If you did this with FlattenerFilterDialog today, the result of setting up three conditions as described above would be (scheme = 'http' and (domain <> 'example.com' or domain <> 'example.net')) which is effectively the same as having no filter on domain at all.
 109
 110 It worked that way before because until now it was only designed for equalities, not inequalities (compare to the situation where your three conditions are scheme is http, domain IS either example.com, example.net).
 111
 112 The multiselect scheme described above allows clauses in the WHERE constraint that look like (domain not in ('example.com','example.net')).
 113