Research Article |
Corresponding author: Florian Ketelaar ( florian@ketelaar.tv ) Academic editor: Oscar van Leeuwen
© 2025 Florian Ketelaar, Ana Mićković.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits to copy and distribute the article for non-commercial purposes, provided that the article is not altered or modified and the original author and source are credited.
Citation:
Ketelaar F, Mićković A (2025) Artificial Intelligence in fraud detection: textual analysis of 10-K filings. Maandblad voor Accountancy en Bedrijfseconomie 99(2): 61-71. https://doi.org/10.5117/mab.99.132881
|
In this paper, we investigate the potential of Artificial Intelligence (AI) in detecting fraud by analyzing linguistic indicators in 10-K filings. We analyze the word frequencies (positive, negative, uncertainty, litigious), consistency, and readability in the MD&A sections. The AI model, BERT, was trained on these factors to predict fraud, showing significant promise compared to traditional models. The findings suggest that fraudulent filings tend to have more positive words, inconsistent language, and higher readability. This highlights AI’s practical role in improving fraud detection in financial reports.
Fraud detection, Artificial Intelligence, financial reports, textual analysis, BERT model
This research demonstrates the practical application of AI, specifically BERT, in enhancing fraud detection in financial reports. By identifying key linguistic indicators of fraud, it provides a tool for auditors and regulators to improve accuracy and efficiency in monitoring and investigating potential financial misconduct.
Technological advancements have continually reshaped industries. One of the most significant transformations currently underway is the rise of Artificial Intelligence (AI), which is becoming increasingly important across various industries and research fields. AI is already impacting many data-related jobs, and therefore it is no surprise that the audit industry is expected to be affected as well. Studies, such as
Consulting companies have started adopting AI tools to improve their audit processes, particularly in areas like fraud detection, which is a major risk for firms due to its potential reputational and legal consequences. As fraud evolves, auditors must innovate. AI tools like HeadStart or Argus are designed to help auditors navigate through complex regulations and analyze entire datasets, rather than just samples, to identify risks, anomalies, and trends (
We aim to answer the following research question: What factors does AI detect as potentially fraudulent from specific linguistic patterns within 10-K filings?
To be able to correctly and efficiently detect financial fraud, particularly in corporate financial statements like 10-K SEC (Securities and Exchange Commission) filings, is very important and cannot be ignored. Traditional methods of financial fraud detection, that are mostly based on human analysis and standard statistical techniques, have shown to be insufficient at identifying fraud. In contrast to traditional fraud detection methods, AI has advanced computational and learning capabilities making it a good tool for an auditor. The study of
Current literature has focused on the potential of AI in multiple domains of financial analysis, but its use in fraud detection within 10-K filings remains under-explored (
Financial fraud, particularly accounting fraud, is a major concern in manipulating financial statements (
The readability of financial reports, particularly in the MD&A sections, can provide insights into fraudulent behaviour. The Fog Index, developed by
In addition to readability, the sentiment expressed through positive and negative words in a text can signal underlying financial conditions. Positive words convey positive sentiment, while negative words reflect a negative tone. However, as
Artificial Intelligence (AI) encompasses advancements like Machine Learning (ML) and Natural Language Processing (NLP), along with traditional statistical methods such as classification and clustering (
In this study, NLP is employed to automatically classify companies as fraudulent or non-fraudulent. The model used is based on BERT (Bidirectional Encoder Representations from Transformers), introduced by
AI has also been widely implemented in accounting, transforming traditional tasks. Studies such as
In the study by
Hypothesis 1: A BERT model remains effective in detecting fraudulent activity in more recent years, with accuracy comparable to or exceeding that reported by
To detect fraud through word categorization, a model is needed to analyze the language and make well-substantiated conclusions. The agency theory highlights linguistic characteristics that may signal fraudulent behavior.
Building on the work of
Hypothesis 2: Firms engaged in fraudulent activities use a higher/lower frequency of positive, negative, litigious, and uncertainty words in their MD&A sections compared to non-fraudulent firms.
To further explore the linguistic factors AI detects, we analyze the frequency of specific word categories over time.
Hypothesis 3: The frequency of word categories in non-fraudulent filings is more consistent than in fraudulent filings.
In addition to analyzing the frequency and consistency of word categories, it is crucial to consider the readability of the MD&A sections as a potential indicator of fraudulent activity. A common measure of readability is the in section 2.2 mentioned Fog Index (
Hypothesis 4: Firms engaged in fraudulent activities have a higher Fog Index in their MD&A sections compared to non-fraudulent firms.
We employ AI models based on machine learning techniques like Natural Language Processing (NLP) and anomaly detection to identify patterns indicative of fraud. Specifically, we use BERT (Bidirectional Encoder Representations from Transformers), designed by Google (
In our analysis, we use the EDGAR database from the US Securities and Exchange Commission, which contains 10-K annual reports of US firms from 1994 to 2024. These 10-K reports, publicly available, were downloaded via the University of Notre Dame (
Our dataset is constructed from the EDGAR database, using 10-K SEC filings from 1994 to 2021, pre-parsed by the University of Notre Dame (
For our supervised learning model, we label examples as fraudulent or not by matching the Central Index Key (CIK), a unique identifier for U.S. firms, between the AAER database and the MD&A sections of 10-K filings. We extract the MD&A text from Item 7 of the 10-K annual reports. A search algorithm identifies the CIK so the MD&A text can be linked to the correct company. If the CIK is found in the AAER database, we label the firm as fraudulent (Fraud = 1); otherwise, it is labeled as non-fraudulent (Fraud = 0). We label all filings by a firm identified as fraudulent in a given year as fraudulent, under the assumption that fraudulent activity may span multiple years and may not be isolated to the filing flagged by regulators. While this may overestimate the number of fraudulent filings, it aligns with prior work suggesting that fraud is often persistent and difficult to detect in a single filing.
The dataset includes a total of 123,415 filings from 1994 to 2021, with 5,785 identified as fraudulent, representing 4.66% of the total. Figure
Fraudulent vs. non-fraudulent 10-K filings per year. Note: The blue line represents the number of non-fraudulent filings per year, shown on the left vertical axis, while the red line represents the number of fraudulent filings per year, shown on the right vertical axis.
To understand the sample, we calculated the total number of words, sentences, and words per sentence in the MD&A sections. We found that the number of words per year in MD&A sections increased significantly over time. Before the year 2000, MD&A sections contained around 2,000 to 3,000 words per year, but after 2015, this range grew to 12,000 to 14,000 words; an increase of nearly 10,000 words over the last 15 years. This aligns with the findings of
On average, the MD&A sections in our dataset contain 8,854 words, 498 sentences, and 17 words per sentence. We observed that both the number of sentences and the average sentence length increased over time, with the rise in total sentences likely explained by the increase in total words. However, the increase in sentence length was an unexpected finding, potentially due to the inclusion of boilerplate information, as noted by
Further, Figure
To validate the findings of
To validate the model and test Hypothesis 1 while staying consistent with
To test Hypotheses 2 and 3, we use the word list from
To test Hypothesis 4, we use the Fog Index developed by
Separate OLS regressions are conducted for the Fog Index and the occurrence of word categories (positive, negative, uncertainty, and litigious) to test for significant differences. We compare trends between two periods, 1994–2007 and 2008–2021, to see if word category usage has shifted. The regression includes variables for baseline measures, whether the MD&A section is non-fraudulent, year, and an interaction term to assess how trends differ between fraudulent and non-fraudulent sections over time.
We formulated Hypothesis 1 to assess whether a BERT model remains effective in detecting fraudulent activity in recent years, with accuracy comparable to or exceeding that of BM (2024). To gauge performance, we use their results as a benchmark, directly comparing our final model to their BERT models.
Table
AUC | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Average |
---|---|---|---|---|---|---|---|---|---|
BERTfirst | 0,738 | 0,892 | 0,846 | 0,878 | 0,824 | 0,878 | 0,823 | 0,869 | 0,844 |
BERTlast | 0,521 | 0,665 | 0,711 | 0,636 | 0,648 | 0,658 | 0,736 | 0,711 | 0,661 |
BERTfinal | 0,657 | 0,825 | 0,842 | 0,814 | 0,771 | 0,823 | 0,817 | 0,827 | 0,797 |
We tested Hypothesis 1 by comparing our BERT models to the benchmark models of
Hypothesis 2 postulates that firms engaged in fraudulent activities use a different frequency of positive, negative, litigious, and uncertainty words in their MD&A sections compared to non-fraudulent firms. To test this, we examine the occurrences of these specific word categories in both fraudulent and non-fraudulent MD&A sections, aiming to determine whether linguistic characteristics differ between the two groups. Figure
Percentage of word occurrences in MD&A sections for fraudulent vs. non-fraudulent companies. Note: The y-axis represents the average yearly percentage of each word category occurring in the MD&A sections. The word categories (Positive, Negative, Uncertainty, and Litigious) are based on the classifications by
We also conduct Ordinary Least Squares (OLS) regression analyses to test for significant differences in word category usage between fraudulent and non-fraudulent MD&A sections over the period 1994–2021. The results show that fraudulent MD&A sections contain significantly more positive words, with the difference being statistically significant at the 99% level. This finding aligns with the suggestion of
However, the analysis reveals no significant differences in the usage of negative, uncertainty, and litigious words between fraudulent and non-fraudulent firms at the 95% significance level, leading us to partially reject H2 for these categories. There is some indication, at the 90% significance level, that non-fraudulent MD&A sections contain fewer negative words.
In Hypothesis 3, we propose that the frequency of word categories in non-fraudulent filings is more consistent over time compared to fraudulent filings. To test this, we examine whether the trend in the frequency of word categories in MD&A sections differs between non-fraudulent and fraudulent firms.
Interestingly, Figure
Trends in word usage for positive, negative, uncertainty, and litigious words across two time periods. Note: The figure presents scatter plots and trend lines for the usage of positive, negative, uncertainty, and litigious words in MD&A sections across two periods: T1 (1994–2007) shown in red, and T2 (2008–2021) shown in blue. It provides a visual comparison of word usage trends over time for each category, illustrating the difference in slopes between the two periods.
Our regression analysis (not tabulated for brevity) examined word category trends (positive, negative, uncertainty, and litigious) between two periods (T1 and T2), as shown in Figure
The results show a significant decline in the use of positive words in both fraudulent and non-fraudulent MD&A during T1. However, in T2, the decline in positive word usage was less steep for fraudulent MD&A, indicating some stabilization, while the trend remained insignificant for non-fraudulent filings.
Negative word usage was significantly higher in the first period for both groups. Interestingly, there was a significant reduction in the use of negative words in the last period, suggesting a shift in language over time.
Both fraudulent and non-fraudulent MD&A showed an increase in uncertainty words during T1, but this trend reversed to a decline in T2. Litigious word usage showed an insignificant positive trend in fraudulent MD&A during T1, which shifted to a negative trend in T2. This suggests that fraudulent filings may be using less legalistic language over time, while the results for non-fraudulent filings were not significant.
In Hypothesis 4, we propose that firms engaged in fraudulent activities have a higher Fog Index, indicating more complex and difficult to read MD&A sections compared to non-fraudulent firms. To test this, we calculated the average Gunning Fog Index for the MD&A sections, with the results visualized in Figure
Average Fog Index per year for fraudulent vs. non-fraudulent MD&A sections. Note: The y-axis displays the Fog Index, which measures the readability of the text. The blue solid line represents the average Fog Index for non-fraudulent MD&A sections per year, while the red solid line represents the average Fog Index for fraudulent MD&A sections per year.
We also conducted a regression analysis to further explore the relationship between the Fog Index and fraudulent vs. non-fraudulent MD&A sections over time. While Figure
Our research explores the use of AI, specifically BERT models, to detect fraud through textual analysis of MD&A sections in 10-K filings. The study tests four hypotheses on the effectiveness of BERT models and linguistic indicators of fraudulent activity.
Hypothesis 1 confirmed that BERT models remain effective in detecting fraud, replicating the findings of
Hypothesis 2 tested whether fraudulent MD&A sections use a higher frequency of positive, litigious, and uncertainty words, and fewer negative words. The results supported the hypothesis for positive words, consistent with
Hypothesis 3 explored whether non-fraudulent MD&A sections show more consistent trends in word categories over time compared to fraudulent sections. While we found some evidence of a decreasing trend in litigious word usage in non-fraudulent filings, the trend lines for both fraudulent and non-fraudulent sections became more stable in recent years. This stabilization could reflect the increasing use of AI-generated reports, as suggested by
Hypothesis 4 examined whether fraudulent MD&A sections are less readable, as measured by the Fog Index. Contrary to the hypothesis, non-fraudulent MD&As had a slightly higher Fog Index, although the difference was not statistically significant. This finding contrasts with
A limitation of this study, noted by
In summary, this study validates and extends the findings of
F.H.E. Ketelaar MSc – Florian
Dr. A. Mićković – Ana is an Assistant Professor of Accounting at the University of Amsterdam.
Florian Ketelaar is one of the winners of the MAB Thesis Award 2024. This article is based on his master thesis.