---
title: "Source Material"
subtitle: "Three asset classes, 38,880 Parquet files, and ten years of loan-level data from the SEC."
description: "A guided tour of the raw data behind every analysis on this site: what we have, where it comes from, and how SEC XML filings become queryable columns."
date: 2026-06-07
categories: [structured-finance, abs-ee, data-lake, infrastructure]
jupyter: python3
freeze: false
execute:
echo: false
warning: false
---
```{python}
from pathlib import Path
import pandas as pd
OUTPUTS = Path("../../analysis/outputs")
coverage = pd.read_parquet(OUTPUTS / "lake_coverage.parquet")
# Clip obviously-partial trailing months (fewer than 10 trusts where prior month had 30+)
def clip_partial(df, asset):
sub = df[df.asset_class == asset].sort_values("reporting_period")
if len(sub) < 2:
return sub
last = sub.iloc[-1]
prev = sub.iloc[-2]
if last.trust_count < prev.trust_count * 0.4:
return sub.iloc[:-1]
return sub
coverage_clean = pd.concat([
clip_partial(coverage, a) for a in ["autoloan", "autolease", "cmbs"]
])
# Hardcoded schema matrix (field counts by category and asset class)
CATEGORIES = [
"Borrower / lessee credit",
"Collateral",
"Loan / lease terms",
"Monthly performance",
"Servicing & fees",
"Loan structure",
"Distress / workout",
"Rate adjustment (ARM)",
"Pipeline metadata",
]
schema_matrix = [
# Auto Loan
{"asset": "Auto Loan", "category": "Borrower / lessee credit", "fields": 6},
{"asset": "Auto Loan", "category": "Collateral", "fields": 7},
{"asset": "Auto Loan", "category": "Loan / lease terms", "fields": 9},
{"asset": "Auto Loan", "category": "Monthly performance", "fields": 12},
{"asset": "Auto Loan", "category": "Servicing & fees", "fields": 8},
{"asset": "Auto Loan", "category": "Loan structure", "fields": 3},
{"asset": "Auto Loan", "category": "Distress / workout", "fields": 1},
{"asset": "Auto Loan", "category": "Rate adjustment (ARM)", "fields": 0},
{"asset": "Auto Loan", "category": "Pipeline metadata", "fields": 9},
# Auto Lease
{"asset": "Auto Lease", "category": "Borrower / lessee credit", "fields": 7},
{"asset": "Auto Lease", "category": "Collateral", "fields": 7},
{"asset": "Auto Lease", "category": "Loan / lease terms", "fields": 9},
{"asset": "Auto Lease", "category": "Monthly performance", "fields": 12},
{"asset": "Auto Lease", "category": "Servicing & fees", "fields": 8},
{"asset": "Auto Lease", "category": "Loan structure", "fields": 4},
{"asset": "Auto Lease", "category": "Distress / workout", "fields": 1},
{"asset": "Auto Lease", "category": "Rate adjustment (ARM)", "fields": 0},
{"asset": "Auto Lease", "category": "Pipeline metadata", "fields": 9},
# CMBS
{"asset": "CMBS", "category": "Borrower / lessee credit", "fields": 0},
{"asset": "CMBS", "category": "Collateral", "fields": 2},
{"asset": "CMBS", "category": "Loan / lease terms", "fields": 12},
{"asset": "CMBS", "category": "Monthly performance", "fields": 14},
{"asset": "CMBS", "category": "Servicing & fees", "fields": 6},
{"asset": "CMBS", "category": "Loan structure", "fields": 9},
{"asset": "CMBS", "category": "Distress / workout", "fields": 9},
{"asset": "CMBS", "category": "Rate adjustment (ARM)", "fields": 10},
{"asset": "CMBS", "category": "Pipeline metadata", "fields": 9},
]
# Sample records — one per asset class, curated fields only
autoloan_sample = [
{"field": "Sponsor", "value": "Santander Consumer USA"},
{"field": "Trust", "value": "Santander Drive ART"},
{"field": "FICO at origination", "value": "673 (Bureau)"},
{"field": "PTI ratio", "value": "14.1%"},
{"field": "Income verification", "value": "Level 2"},
{"field": "Vehicle", "value": "2017 Dodge Charger (New)"},
{"field": "Vehicle value", "value": "$43,738"},
{"field": "Original balance", "value": "$52,827"},
{"field": "Rate", "value": "1.90% (fixed)"},
{"field": "Term", "value": "72 months"},
{"field": "LTV at origination", "value": "1.21×"},
{"field": "Subvented", "value": "Yes"},
]
autolease_sample = [
{"field": "Sponsor", "value": "World Omni Financial Corp"},
{"field": "FICO at origination", "value": "641"},
{"field": "PTI ratio", "value": "15.95%"},
{"field": "Lessee location", "value": "NC"},
{"field": "Vehicle", "value": "2024 Toyota 4Runner (New, SUV)"},
{"field": "Vehicle value", "value": "$54,404"},
{"field": "Acquisition cost", "value": "$51,639"},
{"field": "Contract residual", "value": "$31,020"},
{"field": "Base residual", "value": "$27,808"},
{"field": "Lease term", "value": "48 months"},
{"field": "Sched. termination", "value": "April 2028"},
{"field": "Subvented", "value": "No"},
]
cmbs_sample = [
{"field": "Originator", "value": "Deutsche Bank AG (NY)"},
{"field": "Origination date", "value": "2020-05-13"},
{"field": "Maturity date", "value": "2026-10-09"},
{"field": "Original balance", "value": "$85,000,000"},
{"field": "Rate", "value": "3.55% (fixed)"},
{"field": "Loan term", "value": "77 months"},
{"field": "Lien position", "value": "1st"},
{"field": "Loan structure", "value": "A2"},
{"field": "Interest-only", "value": "Yes (full-term IO)"},
{"field": "Balloon", "value": "Yes"},
{"field": "P&I payment", "value": "$259,840 / period"},
{"field": "Properties (orig.)", "value": "68 → 51 current"},
]
ojs_define(
coverage=coverage_clean.to_dict(orient="records"),
schema_matrix=schema_matrix,
categories=CATEGORIES,
autoloan_sample=autoloan_sample,
autolease_sample=autolease_sample,
cmbs_sample=cmbs_sample,
)
```
Every post on this site is built from the same source: loan-level data filed directly with the SEC.
Under Regulation AB II, issuers of publicly-registered ABS are required to file monthly asset-level
data through Form ABS-EE — every loan, every trust, every reporting period, in structured XML.
It's been a legal requirement since roughly 2016. The data exists, it's public, and EDGAR serves it for free.
The catch is that "publicly available" and "usable at scale" are different things. The filings are
XML, the schema drifts by trust and vintage, and parsing 38,000 files is enough friction that most
analysts don't bother. This site does. Below is what that looks like.
---
## The pipeline
From SEC filing to queryable column is a four-step process. The ingest side runs on a schedule;
the analysis side runs on demand when building a post.
```{ojs}
//| echo: false
{
const details = [
["Form ABS-EE", "EX-102 XML", "~2,400 filings/qtr"],
["XML → Parquet", "Hive partitioning", "abs_lake module"],
["Private Parquet lake", "38,880 files", "Hive-partitioned by asset class, period, trust"],
["SQL query layer", "Serverless S3 reads", "Schema-adaptive merge"],
["SQL aggregations", "aggregated Parquet", "ojs_define() → charts"],
]
const titles = ["SEC EDGAR", "Python pipeline", "Cloudflare R2", "DuckDB", "Analysis layer"]
const stagesHtml = titles.map((title, i) => {
const hl = i === 2
const border = hl ? "#4A6FA5" : "#D8D3C8"
const bg = hl ? "rgba(74,111,165,0.06)" : "rgba(255,255,255,0.25)"
const detailsHtml = details[i].map(d =>
`<div style="color:#8A8A8A;font-family:'DM Mono',monospace;font-size:0.7rem;line-height:1.6;">${d}</div>`
).join("")
const box = `<div style="flex:1;min-width:100px;border:1px solid ${border};border-radius:4px;padding:0.9rem 0.75rem;text-align:center;background:${bg};">
<div style="font-family:'DM Serif Display',serif;font-size:0.9rem;color:#1C1C1E;margin-bottom:0.5rem;">${title}</div>
${detailsHtml}
</div>`
const arrow = i < 4
? `<div style="display:flex;align-items:center;padding:0 4px;margin-top:2.2rem;color:#AAAAAA;font-size:0.9rem;flex-shrink:0;">→</div>`
: ""
return box + arrow
}).join("")
const div = document.createElement("div")
div.innerHTML = `<div style="display:flex;align-items:flex-start;margin:2rem 0;overflow-x:auto;">${stagesHtml}</div><div class="chart-caption">Ingest pipeline runs on a cron schedule. Analysis scripts run at render time against the R2 bucket.</div>`
return div
}
```
The lake is Hive-partitioned by asset class, reporting period, and trust — so a single query pattern
covers a full decade of filings without any local storage. Pre-computed aggregates power the charts;
the raw lake is never queried at page-render time.
---
## Coverage
The mandate had a slow start. Early filings were sparse — three autoloan trusts in January 2017,
a handful of CMBS. That's not a data quality problem; it's the SEC enforcement curve. Trusts took
a year or two to get compliant, and the regulator wasn't aggressive about it immediately. By
mid-2018 the universe had stabilized.
```{ojs}
//| echo: false
//| output: false
d3_inv = require("d3@7")
ASSET_COLOR = ({
"autoloan": "#4A6FA5",
"autolease": "#5A9A82",
"cmbs": "#C89050",
})
ASSET_LABEL = ({
"autoloan": "Auto Loan",
"autolease": "Auto Lease",
"cmbs": "CMBS",
})
```
```{ojs}
//| echo: false
{
const chart = Plot.plot({
marginLeft: 40,
marginRight: 100,
marginTop: 40,
height: 260,
x: {
label: null,
type: "utc",
tickFormat: d3_inv.utcFormat("%Y"),
},
y: {
label: "Trusts reporting",
grid: true,
},
style: { background: "transparent", fontSize: "12px" },
marks: [
...["autoloan", "autolease", "cmbs"].flatMap(asset => [
Plot.areaY(coverage.filter(d => d.asset_class === asset), {
x: d => new Date(d.reporting_period + "-01"),
y: "trust_count",
fill: ASSET_COLOR[asset],
fillOpacity: 0.08,
curve: "monotone-x",
}),
Plot.lineY(coverage.filter(d => d.asset_class === asset), {
x: d => new Date(d.reporting_period + "-01"),
y: "trust_count",
stroke: ASSET_COLOR[asset],
strokeWidth: 1.8,
curve: "monotone-x",
}),
]),
// Right-side labels
...["autoloan", "autolease", "cmbs"].map(asset => {
const last = coverage.filter(d => d.asset_class === asset).at(-1)
return Plot.text([last], {
x: d => new Date(d.reporting_period + "-01"),
y: "trust_count",
text: () => ASSET_LABEL[asset],
dx: 8,
fontSize: 10.5,
fill: ASSET_COLOR[asset],
textAnchor: "start",
})
}),
]
})
return html`<div style="margin:1.5rem 0">${chart}</div>
<div class="chart-caption">Trusts filing per reporting period. Trailing partial months excluded. Source: SEC EDGAR Form ABS-EE.</div>`
}
```
Three asset classes, three distinct profiles. Auto loan has the largest trust count — peaked around
250 and held there. CMBS launched earliest (November 2016) but that serrated pattern is real:
commercial deals mature and roll off the filing roster regularly, so the count swings month to month
as new deals come on and old ones pay off. Auto lease is the thinnest shelf: fewer originators
securitize leases, and it shows.
---
## Auto Loan
```{ojs}
//| echo: false
{
const stats = [
["Coverage", "Jan 2017 – Apr 2026"],
["Reporting periods", "101"],
["Peak trust count", "247 trusts"],
["Active loans (Apr 2026)", "~5.4M"],
["Parquet files", "23,289"],
["Fields per loan", "~70"],
]
const div = document.createElement("div")
div.style.cssText = "display:grid;grid-template-columns:1fr 1fr 1fr;gap:1rem;margin:1.5rem 0;font-family:'DM Mono',monospace;font-size:0.75rem;"
div.innerHTML = stats.map(([label, val]) =>
`<div style="border:1px solid #D8D3C8;border-radius:4px;padding:0.75rem;background:rgba(255,255,255,0.2);">
<div style="color:#AAAAAA;font-size:0.65rem;letter-spacing:0.06em;text-transform:uppercase;margin-bottom:0.3rem;">${label}</div>
<div style="color:#2C2C2C;font-size:0.85rem;">${val}</div>
</div>`
).join("")
return div
}
```
The auto loan asset class is the richest in the lake. Each monthly record captures the full
origination snapshot — FICO score, vehicle make and value, loan amount, rate, term — plus the
current payment status for that reporting month. Active loans (`zeroBalanceCode IS NULL`) are still
in the pool; everything else has paid off, charged off, or been repurchased.
One sample loan from the March 2026 universe — a middle-shelf trust by filing count:
```{ojs}
//| echo: false
{
const accentColor = "#4A6FA5"
return html`<div style="
border-left: 3px solid ${accentColor};
padding: 0.1rem 0 0.1rem 1.25rem;
margin: 1.5rem 0;
">
<div style="font-family:'DM Mono',monospace;font-size:0.68rem;letter-spacing:0.06em;text-transform:uppercase;color:#AAAAAA;margin-bottom:0.75rem;">Auto Loan — sample record, 2026-03</div>
<table style="border-collapse:collapse;width:100%;max-width:520px;">
${autoloan_sample.map((row, i) => html`<tr style="border-bottom:1px solid #E8E3D8;">
<td style="font-family:'DM Mono',monospace;font-size:0.75rem;color:#8A8A8A;padding:0.35rem 1rem 0.35rem 0;white-space:nowrap;vertical-align:top;">${row.field}</td>
<td style="font-family:'DM Mono',monospace;font-size:0.78rem;color:#2C2C2C;padding:0.35rem 0;">${row.value}</td>
</tr>`)}
</table>
</div>`
}
```
A 673 FICO Dodge Charger at 1.21× LTV on a 72-month term. Not a disaster — a 1.9% rate means
it was either subvented by the manufacturer or this is a very old loan near payoff. Probably the
former. That field (`subvented = Yes`) is why the number looks wrong.
**Key fields:** `obligorCreditScore`, `vehicleManufacturerName`, `vehicleModelName`, `vehicleModelYear`,
`vehicleValueAmount`, `originalLoanAmount`, `originalInterestRatePercentage`, `originalLoanTerm`,
`currentDelinquencyStatus`, `zeroBalanceCode`, `chargedOffAmount`, `paymentToIncomePercentage`,
`obligorGeographicLocation`, `subvented`, `underwritingIndicator`
---
## Auto Lease
```{ojs}
//| echo: false
{
const stats = [
["Coverage", "Feb 2017 – Jun 2026"],
["Reporting periods", "113"],
["Peak trust count", "32 trusts"],
["Active leases (Apr 2026)", "~1.3M"],
["Parquet files", "3,618"],
["Fields per lease", "72"],
]
const div = document.createElement("div")
div.style.cssText = "display:grid;grid-template-columns:1fr 1fr 1fr;gap:1rem;margin:1.5rem 0;font-family:'DM Mono',monospace;font-size:0.75rem;"
div.innerHTML = stats.map(([label, val]) =>
`<div style="border:1px solid #D8D3C8;border-radius:4px;padding:0.75rem;background:rgba(255,255,255,0.2);">
<div style="color:#AAAAAA;font-size:0.65rem;letter-spacing:0.06em;text-transform:uppercase;margin-bottom:0.3rem;">${label}</div>
<div style="color:#2C2C2C;font-size:0.85rem;">${val}</div>
</div>`
).join("")
return div
}
```
Auto leases are structurally similar to auto loans but with a critical addition: residual value.
At the end of a lease, the lessee either returns the car or buys it at the `contractResidualValue`.
The gap between `vehicleValueAmount` and `contractResidualValue` is what the trust is betting on —
if used car prices collapse, so does the residual.
That bet is in the data. `baseResidualValue` is the unsubsidized estimate;
`contractResidualValue` is what was promised to the lessee. The spread between them is the cost of
any manufacturer subvention. Filed monthly, for every lease in every pool.
```{ojs}
//| echo: false
{
const accentColor = "#5A9A82"
return html`<div style="
border-left: 3px solid ${accentColor};
padding: 0.1rem 0 0.1rem 1.25rem;
margin: 1.5rem 0;
">
<div style="font-family:'DM Mono',monospace;font-size:0.68rem;letter-spacing:0.06em;text-transform:uppercase;color:#AAAAAA;margin-bottom:0.75rem;">Auto Lease — sample record, 2026-05</div>
<table style="border-collapse:collapse;width:100%;max-width:520px;">
${autolease_sample.map((row, i) => html`<tr style="border-bottom:1px solid #E8E3D8;">
<td style="font-family:'DM Mono',monospace;font-size:0.75rem;color:#8A8A8A;padding:0.35rem 1rem 0.35rem 0;white-space:nowrap;vertical-align:top;">${row.field}</td>
<td style="font-family:'DM Mono',monospace;font-size:0.78rem;color:#2C2C2C;padding:0.35rem 0;">${row.value}</td>
</tr>`)}
</table>
</div>`
}
```
The World Omni/Toyota relationship is interesting here. Acquisition cost ($51,639) is below vehicle
value ($54,404) — the trust paid less than the car is worth, or the value estimate is stale.
The residual gap is $3,212 — World Omni is paying $3k to lower the lessee's end-of-lease buy price.
That's a subsidy to move Toyotas off lots, and it shows up as a direct cost in the trust.
**Key fields:** `lesseeCreditScore`, `lesseeCreditScoreType`, `vehicleManufacturerName`,
`vehicleValueAmount`, `acquisitionCost`, `contractResidualValue`, `baseResidualValue`,
`originalLeaseTermNumber`, `scheduledTerminationDate`, `currentDelinquencyStatus`,
`zeroBalanceCode`, `reportingPeriodSecuritizationValueAmount`
---
## CMBS
```{ojs}
//| echo: false
{
const stats = [
["Coverage", "Nov 2016 – May 2026"],
["Reporting periods", "115"],
["Peak trust count", "213 trusts"],
["Active loans (Apr 2026)", "~15K"],
["Parquet files", "11,973"],
["Fields per loan", "106"],
]
const div = document.createElement("div")
div.style.cssText = "display:grid;grid-template-columns:1fr 1fr 1fr;gap:1rem;margin:1.5rem 0;font-family:'DM Mono',monospace;font-size:0.75rem;"
div.innerHTML = stats.map(([label, val]) =>
`<div style="border:1px solid #D8D3C8;border-radius:4px;padding:0.75rem;background:rgba(255,255,255,0.2);">
<div style="color:#AAAAAA;font-size:0.65rem;letter-spacing:0.06em;text-transform:uppercase;margin-bottom:0.3rem;">${label}</div>
<div style="color:#2C2C2C;font-size:0.85rem;">${val}</div>
</div>`
).join("")
return div
}
```
CMBS is a different world. No borrower FICO, no vehicle. The collateral is commercial real estate
— office buildings, hotels, warehouses — and the schema reflects it. Structural complexity
dominates: interest-only periods, balloon maturities, prepayment premiums, workout strategies,
ARM rate reset mechanics. 106 fields per loan because commercial lending is complicated by design.
The loan count is two orders of magnitude smaller than auto (15k vs 5.4M) but the individual
loans are much larger. That Deutsche Bank sample is $85M on a single note backed by 68 properties.
Those properties have since become 51 — that attrition is filed monthly too.
```{ojs}
//| echo: false
{
const accentColor = "#C89050"
return html`<div style="
border-left: 3px solid ${accentColor};
padding: 0.1rem 0 0.1rem 1.25rem;
margin: 1.5rem 0;
">
<div style="font-family:'DM Mono',monospace;font-size:0.68rem;letter-spacing:0.06em;text-transform:uppercase;color:#AAAAAA;margin-bottom:0.75rem;">CMBS — sample record, 2026-04</div>
<table style="border-collapse:collapse;width:100%;max-width:520px;">
${cmbs_sample.map((row, i) => html`<tr style="border-bottom:1px solid #E8E3D8;">
<td style="font-family:'DM Mono',monospace;font-size:0.75rem;color:#8A8A8A;padding:0.35rem 1rem 0.35rem 0;white-space:nowrap;vertical-align:top;">${row.field}</td>
<td style="font-family:'DM Mono',monospace;font-size:0.78rem;color:#2C2C2C;padding:0.35rem 0;">${row.value}</td>
</tr>`)}
</table>
</div>`
}
```
3.55%, full-term interest-only, balloon maturity in October 2026. That balloon is a few months
away. Either the borrower refinances, the trust extends, or this loan becomes a workout situation —
and if the latter, there's a `workoutStrategyCode` field that will tell you exactly what strategy
the special servicer is pursuing.
**Key fields:** `originalLoanAmount`, `originalInterestRatePercentage`, `originalTermLoanNumber`,
`maturityDate`, `interestOnlyIndicator`, `balloonIndicator`, `loanStructureCode`,
`paymentStatusLoanCode`, `workoutStrategyCode`, `nonRecoverabilityIndicator`,
`NumberPropertiesSecuritization`, `NumberProperties`, `realizedLossToTrustAmount`
---
## Schema comparison
The three asset classes share a pipeline skeleton but diverge sharply in what they measure.
Auto loan and lease are borrower-centric — credit score, income, the car. CMBS drops borrower
credit entirely and adds structural complexity that consumer lending doesn't have: ARM mechanics,
workout codes, distress tracking, property counts.
```{ojs}
//| echo: false
{
const assetOrder = ["Auto Loan", "Auto Lease", "CMBS"]
const catOrder = categories.slice().reverse()
const maxFields = Math.max(...schema_matrix.map(d => d.fields))
const chart = Plot.plot({
marginLeft: 165,
marginBottom: 50,
marginTop: 30,
height: 310,
color: {
scheme: "blues",
domain: [0, maxFields],
unknown: "transparent",
},
x: {
domain: assetOrder,
label: null,
axis: "top",
},
y: {
label: null,
domain: catOrder,
},
style: { background: "transparent", fontSize: "12px" },
marks: [
Plot.cell(schema_matrix, {
x: "asset",
y: "category",
fill: d => d.fields > 0 ? d.fields : null,
rx: 3,
}),
Plot.text(schema_matrix, {
x: "asset",
y: "category",
text: d => d.fields > 0 ? String(d.fields) : "—",
fill: d => d.fields >= 10 ? "white" : (d.fields === 0 ? "#CCCCCC" : "#2C2C2C"),
fontSize: 11,
fontFamily: "'DM Mono', monospace",
}),
]
})
return html`<div style="margin:1.5rem 0">${chart}</div>
<div class="chart-caption">Approximate field counts by category. Pipeline metadata fields are identical across all three asset classes.</div>`
}
```
---
## Interested in the data?
The ingestion pipeline that built this lake — EDGAR scraping at scale, XML schema normalization
across trust vintages, Parquet conversion, partition management — lives in a private repository.
The effort to get here was non-trivial. If you're a researcher, institution, or potential
collaborator interested in the data or the pipeline, reach out:
[[email protected]](mailto:[email protected])