Back to writing

Architecting a fiscal-data platform that processes 64 million companies a month

Momento Fiscal is a SaaS platform that helps tax and legal teams find who owes what. Before it existed, that work was manual and decentralized: an analyst would check the federal active-debt registry, then Serasa, then SPC, then a handful of court systems, copy numbers into spreadsheets, and try to stitch a picture together by hand. There was no single place to track debtors, identify tax liabilities, or build a recovery proposal.

The hard part isn’t the screens. It’s that being useful means consolidating data on 64 million Brazilian companies (CNPJs) every month, from sources that were never designed to be queried together.

The shape of the system

The core is a Ruby on Rails application: PostgreSQL holds the application’s own data, MongoDB holds the 64-million-company dataset, and Redis and Sidekiq do the heavy lifting.

  • Ingestion runs as batched, asynchronous Sidekiq jobs that pull from SERPRO, Receita Federal, SPC, Serasa, federal and state courts, and the active-debt registry.
  • Normalization reconciles the same company described five different ways into one MongoDB record, keyed by CNPJ and grouped geographically.
  • Serving keeps the read path fast: enriched data is precomputed, so the app answers questions instead of computing them on request.

A worked example: the Receita Federal dump

The clearest illustration of the whole pipeline is the one source I can show end to end: the Receita Federal open CNPJ dump. Every month the federal tax authority publishes the entire national company registry as a set of ZIP files. There’s no API, just a directory listing at a predictable URL (/CNPJ/dados_abertos_cnpj/YYYY-MM/), which the job scrapes with Nokogiri to discover that month’s files.

Those files are big, so the job doesn’t download them serially. It issues a HEAD to read the content length, splits the file into 15 MB pieces, and pulls them in parallel with HTTP Range requests over a persistent connection pool (one thread per piece), then reassembles them on disk. A single file becomes dozens of small, retryable downloads instead of one fragile multi-gigabyte stream.

Then the data fights back. The CSVs are Latin-1 encoded and their quoting is broken in ways a standard parser won’t survive: stray escaped quotes, fields delimited by ";" that confuse the quote character itself. So before parsing, the job re-encodes each line to ISO-8859-1 (dropping invalid bytes) and rewrites the broken quote sequences to a placeholder character (§) that the CSV reader can treat as the real quote. This is the unglamorous 80% of data engineering: the format is technically CSV and practically hostile.

One company is spread across four files, all joined on the 8-digit base_cnpj: establishments, companies, Simples (tax regime), and partners. The full CNPJ is base (8) + order (4) + check digits (2), formatted as XX.XXX.XXX/XXXX-XX. The job processes them in order: the establishments pass creates each record (an upsert keyed on the full CNPJ), and the companies and Simples passes enrich those existing records by base_cnpj: corporate name, legal nature, share capital, company size, tax regime. Partners land in their own collection.

Each file is streamed, not loaded: SmarterCSV reads it in chunks of 10–50k rows, and every chunk is serialized and written to MongoDB in a single bulk_write. MongoDB, not the Postgres that backs the app, because this dataset is write-heavy, batch-shaped, and 64 million records deep; bulk upserts are exactly what it’s good at.

Designing for 64 million a month

At that volume, the interesting constraints are operational, not algorithmic:

  • Idempotency. Jobs fail and retry. Every write is an upsert keyed on the CNPJ, so re-running a job converges to the same record instead of duplicating it. And the pipeline is split into independent passes (establishments, companies, Simples, partners) with skip flags, so a run that dies halfway resumes at the file that broke instead of starting over.
  • Backpressure. External sources rate-limit and go down. Sidekiq queues, retries with exponential backoff, and isolation between sources kept one slow provider from stalling the rest.
  • A real-time alert layer built on Redis surfaced changes (a new debt, a status change) at ~99% availability and a p95 of ~200ms, so the platform felt live even though ingestion was a monthly batch.

What I took from it

Processing 64 million records a month sounds like a data problem, and partly it is. But most of the engineering went into making the system boring under failure: idempotent, observable, and resilient to sources misbehaving. The metric I’m proudest of isn’t the 64 million: it’s that the alert path stayed at 99% while everything underneath it was messy, slow, and outside our control. That’s the part of architecture you can’t see in a demo, and it’s the part that decides whether a data product survives contact with production.