How it works Technical overview

From DATASUS FTP to your SQL prompt.

Fifteen stages, every transformation justified, with before-and-after examples. Read this once and you'll know exactly what the pipeline changed about the raw data — and what it didn't touch.

01

Origin: the DATASUS public FTP

Brazilian public-health microdata is published by the Ministry of Health on a public, anonymous FTP server: ftp.datasus.gov.br. Each subsystem (SIHSUS = hospital admissions, SIM = mortality) keeps its monthly or yearly files in a fixed directory. The pipeline does not scrape — it lists the directory, picks files matching a pattern, and downloads only what falls inside the user-specified date range.

note No credentials. The data is anonymised at source — you cannot identify a patient.
02

Filename parsing

Filenames encode UF (state) and period. SIHSUS uses RDUFYYMM.dbc — for example RDSP2401.dbc means São Paulo, January 2024. SIM is yearly and bifurcated by ICD revision: DOSP2023.dbc is CID-10 (1996+, 8-char stem); DORSP80.dbc is CID-9 (1979–1995, 7-char stem). The parser disambiguates by stem length, not by prefix — checking "starts with DOR" first would silently drop CID-10 deaths from RJ, RN, RO, RR, and RS.
03

Decompression: DBC → DBF

DBC is a DATASUS-proprietary compression layered over the venerable DBF table format. Historically you needed Windows-only tools (TabWin) to decompress it. The pipeline uses the pure-Python datasus_dbc library so the whole chain is cross-platform.

note The decompressed DBF file is a temporary intermediate. It is not kept on disk after the next stage.
04

Loading: DBF → DuckDB staging

Each DBF is streamed into a fresh in-memory DuckDB connection as a staging table. Two columns are added at this point: source_file (for audit trails) and uf (extracted from the filename). String columns are stripped of invisible junk (tabs, newlines, NULs) and trimmed.
05

Cleaning: invisible characters and empty strings

Raw DBF rows often contain whitespace and control characters that look fine in a text editor but break joins and uniqueness checks downstream.

Before After

"SP\t\n\r" "SP"
"" (empty string) NULL
" 04 " "04"
06

Type conversions: TRY_CAST

Every numeric and date column goes through TRY_CAST so a single malformed row doesn't abort the whole import — invalid values become NULL and a warning is logged. Integer columns include age, length-of-stay (days), counts. Float columns include monetary fields like VAL_TOT.
07

Date parsing — five formats, ordered

Dates in DATASUS files are not consistent. The parser tries five formats in order: YYYYMMDD, DDMMYYYY, DMMYYYY (a 7-digit format unique to SIM, where the day has no leading zero), YYYY-MM-DD, and as a last resort a generic cast.

Subsystem Raw Parsed

SIHSUS "20200131" 2020-01-31
SIM "1012023" 2023-01-01 (DMMYYYY)
Either "2020-01-31" 2020-01-31
08

Sex normalisation

Each subsystem encodes sex differently. The pipeline collapses both into a common M/F/I (indeterminate) code so cross-subsystem queries don't have to special-case each table.

Subsystem Raw Normalised

SIHSUS 0 / 1 / 3 I / M / F
SIM 1 / 2 M / F
SIM M / F (already textual) M / F
09

Race and colour (RACACOR)

RACACOR carries the same five IBGE categories in both subsystems but with different numeric codes. The pipeline maps them to the human-readable Portuguese label.

Subsystem Code Label

SIHSUS "01" "Branca"
SIHSUS "02" "Preta"
SIM "1" "Branca"
SIM "2" "Preta"
Both "03/3" "Parda"
10

Cause-of-death arrays (SIM)

SIM stores chains of CID-10 codes for each line of the death certificate (linhaa, linhab, linhac, linhad, causabas, linhaii) as star-delimited strings. The pipeline parses them into typed VARCHAR[] arrays so SQL can use UNNEST and array operators directly.

Raw Array

"*A01*J128" ['A01', 'J128']
"*B342" ['B342']
"" or "*" [] (empty)
11

Encoded age (SIM)

SIM packs the unit and the value of age into a single 3-digit field, where the leading digit is the unit (1=minutes, 2=hours, 3=months, 4=years, 5=>100 years). The pipeline decodes it into idade_valor (number) and idade_unidade (text).

Raw Decoded value Decoded unit

"403" 3 anos
"512" 12 >100 anos (i.e., 112)
"307" 7 meses
12

Geographic enrichment (IBGE join)

Both subsystems store a 7-digit IBGE municipality code (munic_res in SIHSUS, codmunres in SIM) but no human-readable city name. The pipeline LEFT JOINs against a built-in IBGE reference table to add the city name, the (already-known) state, and the macro-region.

Before After

munic_res = 3550308 municipio_res = "São Paulo", uf = "SP", regiao = "Sudeste"
munic_res = 3304557 municipio_res = "Rio de Janeiro", uf = "RJ", regiao = "Sudeste"
13

Storage: partitioned Parquet

The transformed table is written to {data_dir}/datasus_db/{subsystem}/uf={UF}/{filename}.parquet using Hive-style partitioning. Partitioning by UF means a query that filters on a single state reads only that state's files. ZSTD compression keeps the disk footprint small without hurting read speed.

note A _manifest.json tracks every DBC already processed, so re-running the pipeline on the same range is a no-op.
14

Query interface: DuckDB VIEW

The Parquet files are exposed as a single virtual table per subsystem (sihsus, sim). The view does a partition-aware glob (read_parquet('datasus_db/sihsus/uf=*/**.parquet')) so scans automatically prune partitions when the WHERE clause filters by UF or by year.
15

Custom queries — your turn

From the Query page you can write any SQL you like against the cleaned, enriched, partitioned data. Built-in column histograms (the # button next to each column) one-tap a GROUP BY query that includes the % population for each value, so distributions are visible without typing.

Subsystem	Raw	Parsed
SIHSUS	"20200131"	2020-01-31
SIM	"1012023"	2023-01-01 (DMMYYYY)
Either	"2020-01-31"	2020-01-31

Subsystem	Raw	Normalised
SIHSUS	0 / 1 / 3	I / M / F
SIM	1 / 2	M / F
SIM	M / F (already textual)	M / F

Subsystem	Code	Label
SIHSUS	"01"	"Branca"
SIHSUS	"02"	"Preta"
SIM	"1"	"Branca"
SIM	"2"	"Preta"
Both	"03/3"	"Parda"

Raw	Array
"A01J128"	['A01', 'J128']
"*B342"	['B342']
"" or "*"	[] (empty)

Before	After
munic_res = 3550308	municipio_res = "São Paulo", uf = "SP", regiao = "Sudeste"
munic_res = 3304557	municipio_res = "Rio de Janeiro", uf = "RJ", regiao = "Sudeste"

Before	After
"SP\t\n\r"	"SP"
"" (empty string)	NULL
" 04 "	"04"

Raw	Decoded value	Decoded unit
"403"	3	anos
"512"	12	>100 anos (i.e., 112)
"307"	7	meses

From DATASUS FTP to your SQL prompt.

Origin: the DATASUS public FTP

Filename parsing

Decompression: DBC → DBF

Loading: DBF → DuckDB staging

Cleaning: invisible characters and empty strings

Type conversions: TRY_CAST

Date parsing — five formats, ordered

Sex normalisation

Race and colour (RACACOR)

Cause-of-death arrays (SIM)

Encoded age (SIM)

Geographic enrichment (IBGE join)

Storage: partitioned Parquet

Query interface: DuckDB VIEW

Custom queries — your turn