Today, data is the essential driver of progress and innovation. With the increasing need for automation in the legal industry, creating high-quality datasets has become a vital step for the training and validation of ML, AI, or NLP applications.

I’m Andrii Kovalov, a Python Engineer at DreamProIT. In today’s blog post, I would like to introduce several datasets that we built from the bills (legislative proposals) in the U.S. Congress. These are rich public data sources, which can be used for various ML/AI applications.

What is a dataset?

Let’s start with a basic definition. So, what is a dataset?

A data set, or a dataset or data collection, refers to a structured collection of data points or observations organized and stored together for analysis or processing. Datasets can encompass different types of data, such as text, numbers, images, audio, and other forms of information that usually share a common theme or subject. They can serve as the foundation for many data-driven activities, including statistical analysis, data mining, and of course training machine learning models. Although the term sounds pretty technical, we come across datasets every day, from statistics reported on the news to weather forecasts, stock performance, and many more.

Our datasets are hosted on HuggingFace, which is an open-source platform widely used for creating and deploying datasets and machine learning models. On HuggingFace, a “dataset” is a specialized format that allows better integration with AI tools.

There are many benefits to creating a dataset in a standardized form on Hugging Face:

  • It allows users to download and use the data with one line of code;

  • The data can be managed efficiently in memory;

  • The fields in the data set can be standardized to use common tools across many different machine learning tasks (labeling, summarization, question, and answer);

  • Each dataset comes with a metadata card that the creator of the dataset can fill in to provide context for the data.

Source of the data

The first step in building a comprehensive dataset is gathering data from one or more sources. Our datasets are the collections of data obtained from the U.S. Government Publishing Office, as, among other things, their API provides comprehensive information about U.S. federal bills and their associated metadata.

The bill data is usually available in xml and pdf formats, which require some re-processing to get plain text or bill summaries. Here is an earlier dataset, available at https://huggingface.co/datasets/billsum, which contains bill summaries from Congress. Our aim was to update the data in this dataset to include the current Congress and also create an open-source repository that can be used to continue to update the bill data in the future (see below).

What is the “bill_text_us” dataset?

https://huggingface.co/datasets/dreamproit/bill_text_us

Our first dataset is “bill_text_us”. It is an extensive and reliable resource containing the text of all bills from the U.S. Congress, since the 108th Congress onward, along with meta information. This dataset can become a great foundation for developing a model that can, for example, provide a semantic search on bill content.

What is the “bill_summary_us” dataset?

https://huggingface.co/datasets/dreamproit/bill_summary_us

Before introducing our next dataset, it is important to get acquainted with manual human summarization provided by the experts at the Congressional Research Service (CRS).

The CRS provides written summaries for most bills, which are made by highly trained legal professionals. These professionals have broad experience in legal writing and a deep understanding of the general context across multiple bills and related documents. Therefore, the accuracy and context of human-written CRS bill summaries are significantly better than that of any summarization generated by ML or AI tools.

Our second dataset, “bill_summary_us”, contains all the bills along with available CRS summaries. The bill texts in the dataset are very clean in terms of grammar, spelling, and format, while the expert summaries are created with great attention to detail. This combination turns into a truly unique set of data that can enhance any ML and AI project that needs or is focused on summarization.

Where can such datasets be useful?

Let’s start with “bill_summary_us”, which can be used for summarization tasks. If we look into the most downloaded summarization datasets on Hugging Face, we will notice that most of the texts and summaries are informal. Our dataset, on the contrary, comprises up-to-date, properly structured bill texts and professionally written, standardized summaries that cannot be obtained from the bill text alone. Such a dataset can benefit any summarization model.

Our “bill_text_us” dataset’s main value is the formal structure of bills, as well as the links between bills and existing U.S. legislation. This dataset can be an excellent resource for models aimed at analyzing, comparing, or searching bills.

How were those datasets made?

Here’s how we created these datasets, and how you can do the same using our open-source project BillML.

The structure of this project contains a docker compose with a single service, but multiple named volumes to store our data. Running the pipeline requires executing CLI commands inside the main container with docker exec or (with slight modification) running docker compose run commands to run commands and exit after they are finished. The project documentation is self-contained and explains every step required to download data, process it and create datasets.

As a quick TL;DR; we can summarize the steps as follows:

  • Set up ENV variables in an .env file (most of the project configuration is done via ENV variables);

  • Run docker compose (by default, the main container will be running indefinitely);

  • Run the docker exec command to download fdsys_billstatus.xml files (they contain meta information about the bill as well as summaries, titles, etc.);

  • Run the docker exec command to convert previously downloaded fdsys_billstatus.xml files to data.json files, which is required for the logic to get summaries and titles;

  • Run the docker exec command to download bills “packages” as .zip archives that contain bill xml and pdf files with all the required bill data downloaded in the steps above;

  • Run docker exec command to create “bill_text_us” or “bill_summary_us” datasets.

All bills and datasets will be stored in Docker-named volumes, so we don’t have to worry about data persistence between runs.

Key insights on these datasets

Overview of the datasets:

  • Total bill’s data size: 105 GB + (as of 17.10.2023)

  • Time to download from the original source: 72+ hours (as of 17.10.2023)

  • Time to create “bill_text_us” dataset: 2.5 hours (depending on your system)

  • Time to create “bill_summary_us” dataset: 2.5 hours (depending on your system)

How to use datasets uploaded to HuggingFace

HuggingFace has thorough documentation for the creation and upload of datasets so just follow along, and you will be able to use the datasets.

What is next for the datasets

As for the future of these datasets, we are planning to set up automatic scheduled runs for the BillML project so that the datasets on HuggingFace will be always up to date and contain all the bill data available. This will be a major benefit for anyone interested in using these datasets for their projects, research, analysis, or any other purpose.

How can you contribute?

Finally, we are always open to contributions from individuals who would like to help improve the project. If you have any ideas or proposals for how the data or metadata can be enhanced, just visit the project’s repository and open an issue with your proposal on how we can enhance data or metadata.

We value all contributions and are grateful for any help that we receive.