by John Burchfield
We’ve made no secret of our love for technology-assisted review (TAR), particularly when it moves beyond TAR 1.0—predictive coding using simple active learning (SAL) and simple passive learning (SPL)—and incorporates continuous active learning (CAL), giving us TAR 2.0. But CAL isn’t the only differentiator: “contextual diversity” also plays a significant role in the success of TAR 2.0.
Because any TAR workflow is most efficient when using judgmental rather than random seeds, it is preferred to begin your process with input from counsel that could come in the form of “hot” documents we have already identified, or even a synthetic model document that contains the kind of information we know we’re looking for. The problem with this is that it can lead to bias in the process, which is often a reason that is cited for using random seeds. This is precisely the challenge that a contextual diversity engine like the one found in Catalyst Predict addresses. Contextual diversity refers to documents that are significantly different from ones already seen by reviewers (like the ones you began your seeding with, or those which you found in the subsequent stages of training the engine). This serves as an additional defensibility safeguard, designed to ensure that little remaining responsive material is left unseen at the end of the review process.
With CAL, as opposed to SAL or SPL, the reviewers are continually re-training the document-ranking algorithm because they are being actively presented with new sets of documents for review. This way, relevant documents are found faster. The continuous learning approach also offers other tools that improve performance, combat potential bias and ensure complete topical coverage throughout the review lifecycle. Contextual diversity is one of these tools, and it addresses all three concerns.
As CAL continuously cycles every document back into the system and uses those judgments to select the next documents, the majority of the documents prioritized are taken from the top of the “relevance” queue. To protect against bias, TAR 2.0 integrates contextual diversity samples as part of the active training in CAL. Accordingly, a small, but not insignificant number of documents are taken from the contextual diversity queue for review.
The contextual diversity algorithm selects documents that are as different as possible from the documents that have been reviewed, while making sure that said documents are as similar as possible to the largest remaining topical groupings of unreviewed documents. In other words, this presents the reviewer with the most representative documents that he knows least about and are quite different, topically speaking, from what he has already seen, regardless of how those documents were initially ranked.
Of course, the TAR 2.0 software platform cannot identify how important a particular topic might be. But it can recognize that no one has viewed, for instance, an entire subset of documents on a specific topic and select representative samples from that specific document population to use as training sets.
As this self-diversification process repeats, it moves from identifying broad themes of unknown information until it gradually works its way into smaller and smaller pockets of information until no significant topical stone is left unturned. As a result, there is not likely to be any meaningful or shared differences among any few responsive documents remaining to be found.
Thus, contextual diversity solves the potential problem of judgmental seed bias. Additionally, it adds another level of quality control, confidence and defensibility, as it makes it highly unlikely that there are any shared traits among the remaining documents that are not already reflected in the documents identified for production. This is a result of the contextual diversity sampling, where the machine has prioritized pockets of words and concepts that have not been seen yet. It also allows CAL to function to its fullest capabilities by constantly selecting documents that will best train the algorithm while making sure that reviewers see a representative sample of the document population.
TAR 2.0 is truly the next frontier of eDiscovery review technology. CAL has altered the landscape in such a way that while TAR 1.0 may still be perfectly effective in some cases, the far more robust TAR 2.0 will eventually surpass it in efficacy. Adding contextual diversity to the CAL process is just one more way that TAR 2.0 is that much better. Spicy, if you will.