Modeling Human Behavior Across Scales: From Genes to Population Patterns

If you have tried to build a longitudinal behavioral dataset, you already know the pain. Relational databases cannot represent the temporal granularity. Data lakes lose the semantic relationships. Graph databases handle connections but choke on high-frequency time series. Every storage paradigm solves part of the problem and fails at the rest.

MindCODE's Data Cloud was designed from the ground up for one specific class of data: longitudinal human behavior across multiple biological and environmental scales. This post describes the architecture, the data model, and how it works in practice.

Why Behavior Data Breaks Standard Approaches

Consider what a comprehensive behavioral record actually contains for a single patient over one year of treatment for major depressive disorder:

52 weekly PHQ-9 assessments (structured, ordinal scale, 9 items each)
365 days of actigraphy at 30-second epochs (approximately 1 million time-series data points)
4 fMRI scans with resting-state connectivity matrices (approximately 35,000 region-pair correlations per scan)
12 monthly blood draws with inflammatory markers (IL-6, TNF-alpha, CRP) and BDNF levels
Genomic data: a fixed genotype with ~500,000 SNP positions relevant to psychiatric pharmacogenomics
52 weekly clinical notes averaging 400 words each (unstructured text)
Environmental context: ZIP code-level socioeconomic data, seasonal daylight hours, local air quality indices

Now try putting this in a single Postgres database. The PHQ-9 scores fit neatly in a table. The actigraphy data needs a time-series store. The fMRI connectivity matrices are dense numerical arrays. The genomic data is sparse and mostly static. The clinical notes need a text index. The environmental data is geospatial and temporal.

A relational database would require so many join patterns that query performance would collapse. A data lake (Parquet files on S3) would preserve the raw data but lose the relationships between modalities. You could build a custom pipeline for each analysis, but every new research question would require a new pipeline.

The Six Data Domains

MindCODE organizes behavioral data into six domains, each with its own storage strategy, temporal characteristics, and access patterns:

Domain 1: Genetics and Epigenetics

Data types: SNP arrays, whole-genome sequencing, DNA methylation arrays (Illumina EPIC 850K), gene expression profiles (RNA-seq)
Temporal characteristics: Genotype is static. Epigenetic markers change over weeks to months. Gene expression changes over hours to days.
Storage: Variant call format (VCF) files indexed by genomic coordinate, methylation beta values in columnar store, expression matrices in HDF5
Volume: ~5GB per patient for WGS, ~50MB for SNP array, ~200MB for methylation

Domain 2: Cells and Tissue

Data types: Blood biomarkers (inflammatory cytokines, neurotrophic factors, hormones), cerebrospinal fluid markers, peripheral immune cell phenotyping
Temporal characteristics: Sampled at clinical visits, typically weekly to monthly
Storage: Structured time-series with lab reference ranges and assay metadata
Volume: Small per sample (~1KB), but longitudinal accumulation matters

Domain 3: Brain Imaging

Data types: Structural MRI (T1-weighted volumetrics), functional MRI (resting-state and task-based connectivity), diffusion tensor imaging (white matter tracts), EEG (spectral power, event-related potentials), MEG
Temporal characteristics: Scans at baseline, mid-treatment, and follow-up. EEG may be continuous for days.
Storage: NIfTI volumes in object store, derived connectivity matrices in array store, EEG in EDF+ format with epoch-level feature extraction
Volume: ~500MB per fMRI session (raw), ~2MB per connectivity matrix (derived)

Domain 4: Digital Phenotype

Data types: Accelerometry, GPS traces, phone usage patterns, sleep staging, heart rate variability, skin conductance, keystroke dynamics, voice prosody features
Temporal characteristics: Continuous or near-continuous (seconds to minutes resolution)
Storage: Time-series database (Apache IoTDB) with configurable downsampling, raw data in object store for reprocessing
Volume: ~50MB/day for a typical wearable sensor suite

Domain 5: Symptoms and Clinical Scales

Data types: Standardized instruments (PHQ-9, GAD-7, MADRS, HAMD-17, C-SSRS, PCL-5), ecological momentary assessments (EMA), clinician-rated scales, patient-reported outcomes
Temporal characteristics: Weekly to monthly for clinic-administered scales, multiple daily for EMA
Storage: Structured tables with instrument metadata, item-level responses, and computed subscale scores
Volume: Small per assessment, but high frequency for EMA

Domain 6: Environment and Population

Data types: Socioeconomic indicators (ADI, SVI), air quality (PM2.5, O3), daylight hours, temperature, neighborhood walkability scores, social determinants of health (SDOH), population-level prevalence data
Temporal characteristics: Updated daily to annually depending on source
Storage: Geospatial-temporal store linked to patient location history
Volume: Shared across patients in the same geographic area

The Canonical Longitudinal Record

The core abstraction in MindCODE's Data Cloud is the Canonical Longitudinal Record (CLR) — a unified, time-indexed representation of a single patient's data across all six domains.

The CLR is not a single table or document. It is a virtual entity — a set of pointers and access methods that present a coherent longitudinal view while allowing each domain to use its optimal storage backend.

CLR: Patient P-20481
├── timeline
│   ├── 2025-06-01 ... 2026-03-15 (active observation window)
│   └── resolution: per-domain (seconds for actigraphy, weeks for scales)
├── genetics
│   ├── genotype: VCF ref → gs://mindcode-genomics/P-20481/wgs.vcf.gz
│   ├── methylation[2025-06-01]: beta values → array-store/P-20481/methyl/baseline
│   └── methylation[2025-12-01]: beta values → array-store/P-20481/methyl/6month
├── biomarkers
│   ├── IL-6: [(2025-06-01, 4.2pg/mL), (2025-07-01, 3.8pg/mL), ...]
│   └── BDNF: [(2025-06-01, 18.4ng/mL), (2025-07-01, 21.1ng/mL), ...]
├── imaging
│   ├── fMRI[2025-06-01]: connectivity → array-store/P-20481/fmri/baseline
│   └── fMRI[2025-09-01]: connectivity → array-store/P-20481/fmri/3month
├── digital_phenotype
│   ├── actigraphy: ts-db/P-20481/accel (30s epochs)
│   ├── sleep: ts-db/P-20481/sleep (nightly summaries + epoch-level)
│   └── gps_mobility: ts-db/P-20481/gps (hourly radius of gyration)
├── scales
│   ├── PHQ-9: [(2025-06-01, 19), (2025-06-08, 18), ..., (2026-03-15, 9)]
│   └── GAD-7: [(2025-06-01, 14), (2025-06-08, 13), ...]
└── environment
    ├── location: ZIP 02138 (Cambridge, MA)
    ├── ADI_national_rank: 22
    └── daylight_hours: time-series linked to location

Cross-Domain Queries

The CLR enables queries that span domains without custom joins:

-- Find patients where PHQ-9 improved >50% AND
-- default mode network connectivity decreased AND
-- sleep efficiency improved >15 percentage points
SELECT patient_id
FROM canonical_longitudinal_record
WHERE scales.PHQ9.percent_change(baseline, week_12) < -50
  AND imaging.fMRI.dmn_connectivity.change(baseline, month_3) < 0
  AND digital_phenotype.sleep.efficiency.change(baseline, week_12) > 15

This query touches three different storage backends (structured tables, array store, time-series database) but presents as a single logical operation to the researcher.

The Feature Factory

Raw longitudinal data is not directly useful for most modeling tasks. You need derived features: slopes, variability measures, change points, cross-domain correlations. Computing these features from scratch for every analysis is wasteful and error-prone.

MindCODE's Feature Factory is a governed computation layer that transforms raw CLR data into reusable behavioral features:

Trajectory features: PHQ-9 slope over 4 weeks, sleep efficiency coefficient of variation, step count trend
Cross-domain features: correlation between daily mood EMA and next-night sleep efficiency, lagged relationship between IL-6 levels and PHQ-9 score changes
Derived biomarkers: heart rate variability (RMSSD) computed from raw RR intervals, circadian rhythm stability index from actigraphy
Population-relative features: patient's PHQ-9 trajectory compared to the cohort mean trajectory, percentile rank for treatment response speed

Each computed feature carries full provenance: the source data versions, the computation code version, the parameters used, and the consent status of the input data. If a patient withdraws consent for wearable data, every feature derived from that data is flagged and excluded from future analyses.

Feature Versioning

Features are versioned independently of source data. When the sleep staging algorithm improves, new feature versions are computed in parallel while old versions remain available for reproducibility. A published study can always reference the exact feature version it used.

Every CLR query passes through MindCODE's policy engine before execution. The policy engine evaluates:

Researcher identity and role: authenticated via institutional IdP
Dataset access policy: which data domains are available for this study protocol
Per-patient consent directives: which patients have consented to this specific use case
De-identification requirements: whether the approved use requires k-anonymity, differential privacy, or full de-identification
Data use agreement (DUA) constraints: some datasets have DUA terms that restrict export, linking, or commercial use

The policy evaluation result — approved, denied, or partially approved with restrictions — is logged alongside the query in the audit trail.

MindCODE supports consent directives at the domain level. A patient might consent to:

Sharing scale scores and clinical notes with their treatment team (Domains 5 and clinical records)
Sharing de-identified scale scores and digital phenotype data for research (Domains 4 and 5, de-identified)
Excluding genomic data from all external sharing (Domain 1 restricted)
Sharing brain imaging with a specific multi-site research consortium (Domain 3, named recipient)

The CLR respects these directives at query time. A researcher querying cross-domain features will only see results for patients whose consent covers all domains involved in that feature.

Building on the Data Cloud

The Data Cloud and CLR are the foundation for everything else in MindCODE: the Virtual Human system reasons over CLRs, the Clinical Intelligence Suite computes treatment response predictions from CLR features, and the research tools enable cohort analyses across thousands of CLRs.

In the next post, we will explain how the Virtual Human system uses the CLR to build comprehensive patient models that support clinician decision-making — without replacing clinical judgment.