In India, the judiciary is grappling with an overwhelming backlog of over 50 million pending cases. Some believe that AI has the potential to reduce the number of cases. Researchers from the University of Liverpool used language models to generate legal arguments from case facts. The top method achieved a 63 per cent overlap with benchmark annotations.

AI can summarise, suggest and predict applicable statutes, reducing the time spent on document processing and aiding legal professionals, says Procheta Sen, one of the authors of the paper: “Automated argument generation from legal facts”.

“We used open-source models like GPT-2 and Facebook’s LLaMA for argument generation,” says Sen. LLaMA (Large Language Model Meta AI) is part of a family of large language models (LLMs) released by Facebook’s Meta AI in February 2023.

Large Language Models and the Judiciary

LLMs have found success in various natural language processing (NLP) tasks such as machine translation, summarisation and entity recognition. Starting with the transformer architecture, these models employ pre-trained, fine-tuned and prompt-based approaches to NLP tasks. Pre-trained models such as like BERT and GPT-2 have outperformed baselines in numerous NLP tasks.

Sen, et al’s research paper usedGPT-2 and Flan-T5 models to generate legal arguments from factual information. Under the umbrella of LLMs, these models are fine-tuned using special tokens like ‘[Facts]’ and ‘[Arguments]’ to guide the generation process. Legal documents, known for their length, pose a challenge due to token limits, which could be overcome by using a BERT summariser for content condensation. The dataset had 50 legal documents from the Indian Supreme Court’s corpus, with each sentence labelled with one of seven rhetorical role categories — facts, ruling by lower court, argument, statute, precedent, ratio of decision, ruling by present court. The core idea lies in optimising argument generation through different summaries facilitated by BERT. Evaluation metrics include average word overlap and average semantic similarity.

The researchers used two evaluation metrics that include average word overlap (it measured shared words between generated and actual arguments) and average semantic similarity (similarity between BERT embeddings of generated and actual arguments). They found that, “ with the increase in the number of sentences in the summary, the quality of the generated argument also increased.”

Additionally, it was found that better data quality enhanced also enhances the model’s performance.

Not everything is rosy

But the challenge in understanding the material stems from the poorly structured English sentences in legal case proceedings, says Sen. This lack of refinement hampers the use of existing NLP tools and requires significant human effort for comprehension, she adds.

The limitations also include privacy concerns. When using paid API services as sensitive data might be shared. Similarly, potential biases can exist in larger datasets, but proper fine-tuning and high-quality data can mitigate this issue, according to Sen.

While NLP has developed significantly, Sen feels that the need of the hour is “well-curated data.” Preserving case processing in a structured manner and creating annotated data also consumes lots of time, adds Sen.

While the research did explore a wide area for the judiciary, the data set was very limited. The current work is an initial exploration and more advanced models are planned for the future, says Sen.