Three asset classes, 38,880 Parquet files, and ten years of loan-level data from the SEC.

structured-finance
abs-ee
data-lake
infrastructure
A guided tour of the raw data behind every analysis on this site: what we have, where it comes from, and how SEC XML filings become queryable columns.
Published

June 7, 2026

Every post on this site is built from the same source: loan-level data filed directly with the SEC. Under Regulation AB II, issuers of publicly-registered ABS are required to file monthly asset-level data through Form ABS-EE — every loan, every trust, every reporting period, in structured XML. It’s been a legal requirement since roughly 2016. The data exists, it’s public, and EDGAR serves it for free.

The catch is that “publicly available” and “usable at scale” are different things. The filings are XML, the schema drifts by trust and vintage, and parsing 38,000 files is enough friction that most analysts don’t bother. This site does. Below is what that looks like.


The pipeline

From SEC filing to queryable column is a four-step process. The ingest side runs on a schedule; the analysis side runs on demand when building a post.

The lake is Hive-partitioned by asset class, reporting period, and trust — so a single query pattern covers a full decade of filings without any local storage. Pre-computed aggregates power the charts; the raw lake is never queried at page-render time.


Coverage

The mandate had a slow start. Early filings were sparse — three autoloan trusts in January 2017, a handful of CMBS. That’s not a data quality problem; it’s the SEC enforcement curve. Trusts took a year or two to get compliant, and the regulator wasn’t aggressive about it immediately. By mid-2018 the universe had stabilized.

Three asset classes, three distinct profiles. Auto loan has the largest trust count — peaked around 250 and held there. CMBS launched earliest (November 2016) but that serrated pattern is real: commercial deals mature and roll off the filing roster regularly, so the count swings month to month as new deals come on and old ones pay off. Auto lease is the thinnest shelf: fewer originators securitize leases, and it shows.


Auto Loan

The auto loan asset class is the richest in the lake. Each monthly record captures the full origination snapshot — FICO score, vehicle make and value, loan amount, rate, term — plus the current payment status for that reporting month. Active loans (zeroBalanceCode IS NULL) are still in the pool; everything else has paid off, charged off, or been repurchased.

One sample loan from the March 2026 universe — a middle-shelf trust by filing count:

A 673 FICO Dodge Charger at 1.21× LTV on a 72-month term. Not a disaster — a 1.9% rate means it was either subvented by the manufacturer or this is a very old loan near payoff. Probably the former. That field (subvented = Yes) is why the number looks wrong.

Key fields: obligorCreditScore, vehicleManufacturerName, vehicleModelName, vehicleModelYear, vehicleValueAmount, originalLoanAmount, originalInterestRatePercentage, originalLoanTerm, currentDelinquencyStatus, zeroBalanceCode, chargedOffAmount, paymentToIncomePercentage, obligorGeographicLocation, subvented, underwritingIndicator


Auto Lease

Auto leases are structurally similar to auto loans but with a critical addition: residual value. At the end of a lease, the lessee either returns the car or buys it at the contractResidualValue. The gap between vehicleValueAmount and contractResidualValue is what the trust is betting on — if used car prices collapse, so does the residual.

That bet is in the data. baseResidualValue is the unsubsidized estimate; contractResidualValue is what was promised to the lessee. The spread between them is the cost of any manufacturer subvention. Filed monthly, for every lease in every pool.

The World Omni/Toyota relationship is interesting here. Acquisition cost ($51,639) is below vehicle value ($54,404) — the trust paid less than the car is worth, or the value estimate is stale. The residual gap is $3,212 — World Omni is paying $3k to lower the lessee’s end-of-lease buy price. That’s a subsidy to move Toyotas off lots, and it shows up as a direct cost in the trust.

Key fields: lesseeCreditScore, lesseeCreditScoreType, vehicleManufacturerName, vehicleValueAmount, acquisitionCost, contractResidualValue, baseResidualValue, originalLeaseTermNumber, scheduledTerminationDate, currentDelinquencyStatus, zeroBalanceCode, reportingPeriodSecuritizationValueAmount


CMBS

CMBS is a different world. No borrower FICO, no vehicle. The collateral is commercial real estate — office buildings, hotels, warehouses — and the schema reflects it. Structural complexity dominates: interest-only periods, balloon maturities, prepayment premiums, workout strategies, ARM rate reset mechanics. 106 fields per loan because commercial lending is complicated by design.

The loan count is two orders of magnitude smaller than auto (15k vs 5.4M) but the individual loans are much larger. That Deutsche Bank sample is $85M on a single note backed by 68 properties. Those properties have since become 51 — that attrition is filed monthly too.

3.55%, full-term interest-only, balloon maturity in October 2026. That balloon is a few months away. Either the borrower refinances, the trust extends, or this loan becomes a workout situation — and if the latter, there’s a workoutStrategyCode field that will tell you exactly what strategy the special servicer is pursuing.

Key fields: originalLoanAmount, originalInterestRatePercentage, originalTermLoanNumber, maturityDate, interestOnlyIndicator, balloonIndicator, loanStructureCode, paymentStatusLoanCode, workoutStrategyCode, nonRecoverabilityIndicator, NumberPropertiesSecuritization, NumberProperties, realizedLossToTrustAmount


Schema comparison

The three asset classes share a pipeline skeleton but diverge sharply in what they measure. Auto loan and lease are borrower-centric — credit score, income, the car. CMBS drops borrower credit entirely and adds structural complexity that consumer lending doesn’t have: ARM mechanics, workout codes, distress tracking, property counts.


Interested in the data?

The ingestion pipeline that built this lake — EDGAR scraping at scale, XML schema normalization across trust vintages, Parquet conversion, partition management — lives in a private repository. The effort to get here was non-trivial. If you’re a researcher, institution, or potential collaborator interested in the data or the pipeline, reach out: [email protected]