Building Domain-Specific LLMs: How to work with your own data

Andrii Kovalov

Python Engineer

Generative AI has created a buzz in the world and continues to impress both regular users and businesses. However, to get the best results from LLMs in industry-specific tasks, generative tools have to be trained using relevant domain-specific data.

In our previous blog post, we shared insights from our research on how to implement LLMs in the legal industry. You can check it out here. Today, we will continue our investigative journey to explore the process of enhancing LLMs with your own domain-specific private data.

Understanding the specifics of legal data

As part of our daily work, we engage with diverse types of legal data, which includes semi-structured data in PDF format and structured data in XML format. To make sense of this data, we use classic software-based approaches to parse and interpret a natural language linking and citation system.

The legal dataset we work with has its unique nuances, presenting several challenges:

Lengthy tables in regular PDFs - imagine having to work with tables spanning dozens of regular PDF pages.
Pages filled with financial data – having to sift through pages of financial data containing mostly numbers and a few words for dozens of pages
General dryness and repetitiveness - legal language can be dry and appear repetitive, with recurring patterns and formal structures that require a nuanced understanding during processing.

In the face of all these challenges, our goal is to extract meaningful information and insights, while adapting to the specifics of the legal domain and the diverse formats in which it is presented.

How to enhance LLMs with your own private data?

In our research, we have identified 2 primary ways for integrating LLMs with your own proprietary data: fine-tuning and RAG. Let’s take a closer look at both methods.

Fine-tuning LLMs with your own data

Fine-tuning is a supervised machine-learning technique that is used to improve the performance of a pre-trained language model (LLM) for specific tasks. The process involves training the LLM on a new dataset of labeled examples to update the weights of the model to improve its accuracy and ability to generate high-quality outputs.

To put it simply, the LLM is provided with a dataset of prompts and expected completions. After training on a dataset with 1000 examples of these pairs of prompts and completions, this LLM is expected to “remember” or “learn” how to generate accurate and relevant outputs for these prompts. Therefore, once the LLM has been fine-tuned, it can then use its newly acquired knowledge to tackle the 1001st prompt and other unseen prompts and generate high-quality completions for them.

Fine-tuning has proven to be highly effective in improving the completion quality of LLMs in various areas, including customer support, code generation, and question-answering. For instance, in the field of customer support, where the range of possible questions can be somewhat limited, fine-tuning has been particularly effective in improving the accuracy of LLMs in generating relevant and helpful responses. Similarly, in code generation, fine-tuning has been used to train models specifically for generating SQL queries, resulting in significant improvements in accuracy and efficiency. At the same time, it’s also effective for question-answering cases requiring a chain of thoughts and reasoning to answer the question.

Complexity and requirements for fine-tuning LLMs

The process of fine-tuning LLMs has become significantly less complex, thanks to the active participation of the open-source community in developing code and tools that streamline the process and make it more accessible to a wider range of users.

One of the many remarkable advancements in recent developments for running LLMs is quantization. This breakthrough has made LLMs accessible to regular consumer-grade hardware, such as your own GPU. Another significant breakthrough is LoRa, a very effective technique to reduce the number of trainable parameters for LLMs. The continuous evolution of technologies in the field has given rise to qLoRa, an even more efficient approach to fine-tuning quantized LLMs.

Although GPUs are still essential and you have to tackle various technical aspects such as code, tools, and infrastructure, the quality and completeness of the training dataset are now the most critical and difficult aspects, directly influencing the model’s output quality. In other words, the better the dataset, the better the model’s output will be.

Therefore, having an ideal dataset with an adequate number of items is immensely important. The time and effort invested in creating the right dataset will pay off in the model’s ability to learn and generate quality output.

What to expect from fine-tuned LLMs?

As I mentioned earlier, the results obtained from fine-tuned LLMs depend on the quality of the training dataset used. With a high-quality dataset, it is possible to achieve promising outcomes. You can find many examples of such results in articles like this one, which also provides in-depth insights into the process.

In essence, if you have user prompts and a clear understanding of the expected completion, it can be a good start for creating a training dataset and subsequently using it to fine-tune your own model.

Leveraging RAG to enhance LLMs with private data

Retrieval-augmented generation (RAG) is a technique that improves the quality of responses generated by LLMs by combining neural information retrieval with neural text generation.

While the definition of RAG might be somewhat difficult to understand, the idea behind it is simple. Stateless applications, such as LLMs, treat every user’s prompt as a brand new one, generating completions based solely on their training data. But, what if we could add additional context to the user’s question or request in the prompt? By doing so, we could significantly improve the quality of LLMs’ completions.

So, how to add the right context relevant to the user’s question or request? This is where RAG comes into play. By leveraging vector search and creating embeddings for the source data, we can use this embedded data as context for the user’s question or request. This will allow us to conduct a vector search with the user’s request to “Retrieve” the “Augmented” (embedded) data, which is then presented to the LLM as context alongside the user’s original prompt. This approach significantly enhances the quality of LLM completions by incorporating relevant contextual information.

Creating effective RAG strategies

To implement RAG successfully and achieve high-quality completions, it’s important to have the right strategy in place. The “basic” RAG strategies outlined in many tutorials are commonly heavily reliant on data quality. If you are dealing with unprocessed, uncleaned, and raw data, the results of extraction can be poor.

Here are key constraints and considerations you need to bear in mind:

High-quality embedded data. The quality of embedded data has to be very high. You need to take into account such factors as the size of raw data chunks, how these chunks overlap, and how meaningful the data within each chunk is.
Optimal retrieval results. Consider how many top results of the retrieval you are providing in the LLMs prompt. Providing too much information can potentially diminish the quality of the results.
Use advanced strategies. For optimal results, consider more advanced strategies. Experimenting with different chunk sizes, various ranking and re-ranking approaches, and exploring combinations such as question+retrieved data or question+retrieved data+answer can significantly enhance the quality of LLM completions.

Testing and evaluating the RAG setup

No matter what kind of RAG strategy you are using, it is very important to make sure you have a robust system of checking and evaluating various aspects of your RAG setup.

Here are the key factors to consider:

Relevance of context. How relevant is this context (retrieved embedded data) to the user’s question? This evaluation will ensure that the context provided enhances the overall understanding of the inquiry.
Relevance of answer. How relevant is the answer to the user’s question in that context? This step will ensure that the generated responses align with the contextual information retrieved.

Using LLMs for these evaluations is a cost-effective alternative to manual human evaluation, but it is essential to understand that this method relies on the accuracy of the model’s assessment. Although not infallible, it is still better than no evaluation at all.

If you’re looking for additional resources to help with your RAG setup, there are many projects worth exploring. I recommend keeping an eye on and experimenting with projects like Ragas and TruLens to enhance your RAG configuration.

Closing thoughts

The need to provide context data for LLMs has introduced a lot of challenges, giving rise to a vast landscape of projects and technologies that are trying to address these issues. On top of that, the private data intended for use as context for LLMs is often uncleaned and unstructured, requiring additional transformations to be servable. All of these factors bring more challenges and interesting problems to solve for developers around the world.