Scraping the SEC

All companies publicly traded in America must file a yearly financial statement (called the “10-K”) and four quarterly financial statements (called the “10-Q”) per annum. These documents are pored over by equity research analysts, accountants, etc. to gain an understanding of the company’s financial situation. They are also an incredibly rich source of natural language data for machine learning models, but the primary obstacle to leveraging them is that these forms are all in raw .html or .txt formats. Complicating the process is that many companies like to structure their statements their own way, or they add XBRL tables, PDF sections or images which look like garbled text to the parser.

For one of my research projects, I scraped the entire SEC database of its 10-Q’s and 10-K’s and segment the texts into their respective sections (for example, section 3 concerns Legal Proceedings and section 6 contains tables of financial metrics). I open-sourced the code and the data. They are both unusually popular; the code has over 40 stars and nearly 30 forks, while the dataset has over 440 bookmarks and is one of the most popular on the data.world platform. I suspect this is because (a) I am the only one silly enough to use R for this kind of work and (b) everyone else who has actually scraped the SEC either kept their data private or gated it behind a pay-for-use API.

In 2021, there are much better ways to scrape the SEC nowadays (e.g. the python-edgar package) and other people with cleaned text files (e.g. Notre Dame)so I don’t really see the use of updating my own codebase anymore. Nonetheless the lack of open access around this incredibly valuable set of natural language data is a little disappointing, and I think there is still a huge opportunity for someone to have this text data linked to other important financial data sources (the most obvious one being stock price data). I definitely would not mind returning to it when I have time!