Back

Announcing enhanced table extractions with Amazon Textract

Amazon Texttract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Texttract has a tables function within the AnalyzeDocument API that provides the ability to automatically extract tabular structures from any document. In this post, we discuss the improvements made to the Tables feature and how it’s easier to extract information in tabular structures from a wide variety of documents.

Tabular structures in documents such as financial reports, pay stubs, and certificate of analysis files usually have a format that allows for easy interpretation of the information. They also often include information such as table title, table footer, section title, and summary rows within the tabular structure for better readability and organization. For a similar document prior to this enhancement, the Tables appear within AnalyzeDocument would have identified these elements as cells and did not extract headers and footers that are outside the table boundaries. In these cases, custom post-processing logic was required to identify this information or extract it separately from the API’s JSON output. With this announcement of enhancements to the Table function, extracting various aspects of tabular data becomes much simpler.

In April 2023, Amazon Texttract introduced the ability to automatically detect titles, footers, section headings, and summary rows present in documents using the Tables feature. In this post, we discuss these improvements and provide examples to help you understand and use them in your document processing workflows. We’ll explain how to use these enhancements using code examples to use the API and process the response with the Amazon Textractor library.

Solution overview

The image below shows that the updated model not only identifies the table in the document but all the corresponding headers and footers. This sample financial report document contains the table title, footer, section title, and summary rows.

Financial report with table

The Tables feature enhancement adds support for four new elements to the API response that allows you to easily extract each of these table elements and adds the ability to distinguish the table type.

Table elements

Amazon Texttract can identify various components of a table, such as table cells and combined cells. These components, known as Blockobjects, encapsulate details related to the component, such as bounding geometry, relationships, and confidence score. A Block represents elements that are recognized in a document within a group of pixels that are close to each other. The following are the new table blocks introduced in this enhancement:

  • Table title – A new one Block named type TABLE_TITLE which allows you to identify the title of a particular table. Titles can be one or more lines, which are usually above a table or embedded as a cell within the table.
  • table legs – A new one Block named type TABLE_FOOTER which allows you to identify the footers associated with a given table. Footers can be one or more lines that are usually below the table or embedded as a cell within the table.
  • Title of the section – A new one Block named type TABLE_SECTION_TITLE which allows you to identify whether the detected cell is a section heading.
  • Summary cells – A new one Block named type TABLE_SUMMARY which allows you to identify whether the cell is a summary cell, such as a cell for the totals of a pay slip.

Financial report with table elements

Types of tables

When Amazon Texttract identifies a table in a document, it extracts all the details of the table at a higher level. Block kind of TABLE. Tables can have different shapes and sizes. For example, documents often contain tables that may or may not have a discernible table header. To help distinguish between these types of tables, we’ve added two new entity types for aa TABLE Block: SEMI_STRUCTURED_TABLE i STRUCTURED_TABLE. These entity types help you distinguish between a structured table and a semi-structured table.

Structured tables are tables that have clearly defined column headers. But with semi-structured tables, the data may not follow a strict structure. For example, data may appear in a tabular structure other than a table with defined headers. The new entity types provide the flexibility to choose which tables to keep or drop during post-processing. The image below shows an example STRUCTURED_TABLE i SEMI_STRUCTURED_TABLE.

Types of tables

Parsing the API output

In this section, we explore how you can use the Amazon Textractor library to post-process the output of the API AnalyzeDocument with improvements to the Tables feature. This allows you to extract relevant information from the tables.

Textractor is a library built to work seamlessly with Amazon Texttract APIs and utilities to later convert the JSON responses returned by the APIs into programmable objects. You can also use it to view entities in the document and export the data in formats such as comma-separated value (CSV) files. It is intended to help Amazon Texttract customers configure their post-processing pipelines.

In our examples, we use the following sample page from an SEC 10-K filing.

10-K filing SEC

The code below can be found in our GitHub repository. To process this document, we make use of the Textractor library and import it so that we can post-process the API outputs and visualize the data:

pip install amazon-textract-textractor

The first step is to call Amazon Text AnalyzeDocument with the Tables function, denoted by the features=[TextractFeatures.TABLES] parameter to extract the information from the table. Note that this method invokes the AnalyzeDocument real-time (or synchronous) API, which supports single-page documents. However, you can use asynchronous StartDocumentAnalysis API for processing multi-page documents (with up to 3000 pages).

from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType
image = Image.open("sec_filing.png") # loads the document image with Pillow
extractor = Textractor(region_name="us-east-1") # Initialize textractor client, modify region if required
document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)

The document The object contains metadata about the document that can be reviewed. Note that it recognizes a table in the document along with other entities in the document:

This document holds the following data:
Pages - 1
Words - 658
Lines - 122
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents – 0

Now that we have the API output containing the table information, let’s view the different elements of the table using the response structure discussed above:

table = EntityList(document.tables[0])
document.tables[0].visualize()

Table of 10-K SEC filings highlighted

The Textractor library highlights the different entities in the detected table with a different color code for each element in the table. Let’s take a closer look at how we can extract each element. The following code snippet shows extracting the table title:

table_title = table[0].title.text
table_title

'The following table summarizes, by major security type, our cash, cash equivalents, restricted cash, and marketable securities that are measured at fair value on a recurring basis and are categorized using the fair value hierarchy (in millions):'

Similarly, we can use the following code to extract table footers. Note that table_footers is a list, which means that there can be one or more footers associated with the table. We can loop through this list to see all the footers present, and as shown in the following code snippet, the output shows three footers:

table_footers = table[0].footers
for footers in table_footers:
    print (footers.text)

(1) The related unrealized gain (loss) recorded in "Other income (expense), net" was $(116) million and $1.0 billion in Q3 2021 and Q3 2022, and $6 million and $(11.3) billion for the nine months ended September 30, 2021 and 2022.

(2) We are required to pledge or otherwise restrict a portion of our cash, cash equivalents, and marketable fixed income securities primarily as collateral for real estate, amounts due to third-party sellers in certain jurisdictions, debt, and standby and trade letters of credit. We classify cash, cash equivalents, and marketable fixed income securities with use restrictions of less than twelve months as "Accounts receivable, net and other" and of twelve months or longer as non-current "Other assets" on our consolidated balance sheets. See "Note 4 - Commitments and Contingencies."

(3) Our equity investment in Rivian had a fair value of $15.6 billion and $5.2 billion as of December 31, 2021 and September 30, 2022, respectively. The investment was subject to regulatory sales restrictions resulting in a discount for lack of marketability of approximately $800 million as of December 31, 2021, which expired in Q1 2022.

Generating data for subsequent ingestion

The Textractor library also helps you simplify ingesting table data into backend systems or other workflows. For example, you can export the extracted table data to a human-readable Microsoft Excel file. At the time of this writing, this is the only format that supports combined tables.

table[0].to_excel(filepath="sec_filing.xlsx")

Table in Excel

We can also convert it to a Pandas DataFrame. DataFrame is a popular choice for data manipulation, analysis, and visualization in programming languages ​​such as Python and R.

In Python, DataFrame is a primary data structure in the Pandas library. It is flexible and powerful, and is often the first choice for data analytics professionals for various ML and data analysis tasks. The following code snippet shows how to convert the extracted table information into a DataFrame with a single line of code:

df=table[0].to_pandas()
df

Table in DataFrame

Finally, we can convert the table data into a CSV file. CSV files are often used to ingest data into relational databases or data warehouses. See the following code:

table[0].to_csv()

',0,1,2,3,4,5n0,,"December 31, 2021",,September,"30, 2022",n1,,Total Estimated Fair Value,Cost or Amortized Cost,Gross Unrealized Gains,Gross Unrealized Losses,Total Estimated Fair Valuen2,Cash,"$ 10,942","$ 10,720",$ -,$ -,"$ 10,720"n3,Level 1 securities:,,,,,n4,Money market funds,"20,312","16,697",-,-,"16,697"n5,Equity securities (1)(3),"1,646",,,,"5,988"n6,Level 2 securities:,,,,,n7,Foreign government and agency securities,181,141,-,(2),139n8,U.S. government and agency securities,"4,300","2,301",-,(169),"2,132"n9,Corporate debt securities,"35,764","20,229",-,(799),"19,430"n10,Asset-backed securities,"6,738","3,578",-,(191),"3,387"n11,Other fixed income securities,686,403,-,(22),381n12,Equity securities (1)(3),"15,740",,,,19n13,,"$ 96,309","$ 54,069",$ -,"$ (1,183)","$ 58,893"n14,"Less: Restricted cash, cash equivalents, and marketable securities (2)",(260),,,,(231)n15,"Total cash, cash equivalents, and marketable securities","$ 96,049",,,,"$ 58,662"n'</p><h2> </h2>

conclusion

The introduction of these new types of blocks and entities (TABLE_TITLE, TABLE_FOOTER, STRUCTURED_TABLE, SEMI_STRUCTURED_TABLE, TABLE_SECTION_TITLE, TABLE_FOOTERi TABLE_SUMMARY) represents a significant advance in extracting tabular structures from documents with Amazon Texttract.

These tools provide a more nuanced and flexible approach, catering to both structured and semi-structured tables and ensuring that no important data is overlooked, regardless of its location in a document.

This means we can now handle multiple data types and table structures with improved efficiency and accuracy. As we continue to embrace the power of automation in document processing workflows, these improvements will undoubtedly pave the way for more streamlined workflows, greater productivity, and more detailed data analysis. For more information on AnalyzeDocument and the Tables function, see AnalyzeDocument.


About the authors

Raj Pathak is a Senior Solutions Architect and Technologist specialized in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM), and Machine Learning Infrastructure and Operations (MLOps) projects.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the global AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing and retail organizations and is actively helping customers get started and scale AWS AI services.

Lalita ReddyLalita Reddy is a Senior Technical Product Manager with the Amazon Text team. It focuses on building machine learning-based services for AWS customers. In her spare time, Lalita enjoys playing board games and hiking.

Source link
We are proud to announce the latest enhancements to our Amazon Textract table extraction capabilities, powered by the advanced technology offered by Ikaroa. Textract is an AI-driven service that enables you to quickly extract text and data from scanned documents and images, and table extractions have been made even easier with the inclusion of Ikaroa’s robust table solutions.

Ikaroa’s powerful solution offers instant organization of key data components, allowing customers to quickly get the information they need with the ability to export results into other applications or programs. For customers, this means no more manual labor being wasted on sifting through data – simply click and the desired data points are delivered.

In addition to table extractions, Amazon Textract uses machine learning to analyse forms and tables, plus powerful document structuring technology to read both handwriting and printed text for quick and accurate results. All of this combines to offer one of the most advanced document processing capabilities on the market.

With Textract continually being improved and expanded, customers can now tap into the power of machine learning and transform boring, manual tasks into easy, automated processes. Coupled with the additional tools provided with Ikaroa’s advanced table solutions, customers are provided with even more flexibility to extract the key data points in any given document with ease.

We at Ikaroa are excited to see our advanced table solutions continue to be included in Amazon Textract; providing customers with the latest in AI-driven document processing services to simplify their lives. To explore the full range of features offered through Textract, please visit the Amazon homepage for more information.

ikaroa
ikaroa
https://ikaroa.com

Leave a Reply

Your email address will not be published. Required fields are marked *