Extract text based instructions into a structured format

Unformatted text to structured data with GenAI

# artifical-intellegence

Project overview

Technologies

  • GPT-4 with Vision
  • Azure Document Intelligence

Tools

  • Text recognition
  • Python

Key features

  • Covert instructions from various formatting into consistent, structured data

Contributors


Why the client wanted this

  1. A reliable and accurate way of digitising a vast and diverse library of knitting instuctions from a variety of sources like images, magazines and documents - at scale
  2. Get the instructions into a structured format using a large language model so they could be archived and distributed.
  3. As far as we're aware, this AI-led approach to digitising knitting instructions had not been attempted before.

Methodology

Collecting documents

To begin this process, it is key to collect all the documents and images that need to be digitised. These can be in any readable format such as PDF and images, as long as the text is visible.

Extracting the text

Next came the question of how we were going to extract all the text verbatim into raw text. Enter Azure's Document Intelligence (DI) tool, backed by OpenAI's GPT-4.

DI was fairly reliable for getting the text into a good structured format so that is easily parsable for any AI to then convert into any other desired format.

Using GPT-4 to power the web app's UI

Now that we had the text in a digital format all that was needed was to get the instructions in a consistent, structured format so that the web app's UI had reliable data to be powered from.

For this we used GPT-4's chat completion feature and primed it to return only JSON (the desired output format for the UI) with the API's JSON mode, and gave a system prompt to ensure consistent output with our requirements for how we wanted the instructions presented.


Tech stack

Upload documents

The knitting instructions

Azure document intelligence

Text extraction

GPT-4 Turbo

Formatting to JSON


The prototype


Not just needles and threads

Mass analysis of legal and financial documents

Consider the legal industry, where contracts and case files often exist in a blend of typed and handwritten formats. Applying our techniques could streamline the analysis and organisation of these critical documents. The same could be said for the financial sector, awash with bank statements, invoices, and receipts, often a mix of digital and handwritten entries.

Educational resource compilation

Think of schools where educators compile varied teaching materials, including handwritten notes. Our solution could assist in creating a unified, digital repository of educational resources.