datasus-etl
How it works Technical overview

From DATASUS FTP to your SQL prompt.

Fifteen stages, every transformation justified, with before-and-after examples. Read this once and you'll know exactly what the pipeline changed about the raw data — and what it didn't touch.

  1. 01

    Origin: the DATASUS public FTP

    Brazilian public-health microdata is published by the Ministry of Health on a public, anonymous FTP server: ftp.datasus.gov.br. Each subsystem (SIHSUS = hospital admissions, SIM = mortality) keeps its monthly or yearly files in a fixed directory. The pipeline does not scrape — it lists the directory, picks files matching a pattern, and downloads only what falls inside the user-specified date range.

  2. 02

    Filename parsing

    Filenames encode UF (state) and period. SIHSUS uses RDUFYYMM.dbc — for example RDSP2401.dbc means São Paulo, January 2024. SIM is yearly and bifurcated by ICD revision: DOSP2023.dbc is CID-10 (1996+, 8-char stem); DORSP80.dbc is CID-9 (1979–1995, 7-char stem). The parser disambiguates by stem length, not by prefix — checking "starts with DOR" first would silently drop CID-10 deaths from RJ, RN, RO, RR, and RS.

  3. 03

    Decompression: DBC → DBF

    DBC is a DATASUS-proprietary compression layered over the venerable DBF table format. Historically you needed Windows-only tools (TabWin) to decompress it. The pipeline uses the pure-Python datasus_dbc library so the whole chain is cross-platform.

  4. 04

    Loading: DBF → DuckDB staging

    Each DBF is streamed into a fresh in-memory DuckDB connection as a staging table. Two columns are added at this point: source_file (for audit trails) and uf (extracted from the filename). String columns are stripped of invisible junk (tabs, newlines, NULs) and trimmed.

  5. 05

    Cleaning: invisible characters and empty strings

    Raw DBF rows often contain whitespace and control characters that look fine in a text editor but break joins and uniqueness checks downstream.

    BeforeAfter
    "SP\t\n\r""SP"
    "" (empty string)NULL
    " 04 ""04"
  6. 06

    Type conversions: TRY_CAST

    Every numeric and date column goes through TRY_CAST so a single malformed row doesn't abort the whole import — invalid values become NULL and a warning is logged. Integer columns include age, length-of-stay (days), counts. Float columns include monetary fields like VAL_TOT.

  7. 07

    Date parsing — five formats, ordered

    Dates in DATASUS files are not consistent. The parser tries five formats in order: YYYYMMDD, DDMMYYYY, DMMYYYY (a 7-digit format unique to SIM, where the day has no leading zero), YYYY-MM-DD, and as a last resort a generic cast.

    SubsystemRawParsed
    SIHSUS"20200131"2020-01-31
    SIM"1012023"2023-01-01 (DMMYYYY)
    Either"2020-01-31"2020-01-31
  8. 08

    Sex normalisation

    Each subsystem encodes sex differently. The pipeline collapses both into a common M/F/I (indeterminate) code so cross-subsystem queries don't have to special-case each table.

    SubsystemRawNormalised
    SIHSUS0 / 1 / 3I / M / F
    SIM1 / 2M / F
    SIMM / F (already textual)M / F
  9. 09

    Race and colour (RACACOR)

    RACACOR carries the same five IBGE categories in both subsystems but with different numeric codes. The pipeline maps them to the human-readable Portuguese label.

    SubsystemCodeLabel
    SIHSUS"01""Branca"
    SIHSUS"02""Preta"
    SIM"1""Branca"
    SIM"2""Preta"
    Both"03/3""Parda"
  10. 10

    Cause-of-death arrays (SIM)

    SIM stores chains of CID-10 codes for each line of the death certificate (linhaa, linhab, linhac, linhad, causabas, linhaii) as star-delimited strings. The pipeline parses them into typed VARCHAR[] arrays so SQL can use UNNEST and array operators directly.

    RawArray
    "*A01*J128"['A01', 'J128']
    "*B342"['B342']
    "" or "*"[] (empty)
  11. 11

    Encoded age (SIM)

    SIM packs the unit and the value of age into a single 3-digit field, where the leading digit is the unit (1=minutes, 2=hours, 3=months, 4=years, 5=>100 years). The pipeline decodes it into idade_valor (number) and idade_unidade (text).

    RawDecoded valueDecoded unit
    "403"3anos
    "512"12>100 anos (i.e., 112)
    "307"7meses
  12. 12

    Geographic enrichment (IBGE join)

    Both subsystems store a 7-digit IBGE municipality code (munic_res in SIHSUS, codmunres in SIM) but no human-readable city name. The pipeline LEFT JOINs against a built-in IBGE reference table to add the city name, the (already-known) state, and the macro-region.

    BeforeAfter
    munic_res = 3550308municipio_res = "São Paulo", uf = "SP", regiao = "Sudeste"
    munic_res = 3304557municipio_res = "Rio de Janeiro", uf = "RJ", regiao = "Sudeste"
  13. 13

    Storage: partitioned Parquet

    The transformed table is written to {data_dir}/datasus_db/{subsystem}/uf={UF}/{filename}.parquet using Hive-style partitioning. Partitioning by UF means a query that filters on a single state reads only that state's files. ZSTD compression keeps the disk footprint small without hurting read speed.

  14. 14

    Query interface: DuckDB VIEW

    The Parquet files are exposed as a single virtual table per subsystem (sihsus, sim). The view does a partition-aware glob (read_parquet('datasus_db/sihsus/uf=*/**.parquet')) so scans automatically prune partitions when the WHERE clause filters by UF or by year.

  15. 15

    Custom queries — your turn

    From the Query page you can write any SQL you like against the cleaned, enriched, partitioned data. Built-in column histograms (the # button next to each column) one-tap a GROUP BY query that includes the % population for each value, so distributions are visible without typing.