Modeling Human Behavior Across Scales: From Genes to Population Patterns
Modeling Human Behavior Across Scales: From Genes to Population Patterns
If you have tried to build a longitudinal behavioral dataset, you already know the pain. Relational databases cannot represent the temporal granularity. Data lakes lose the semantic relationships. Graph databases handle connections but choke on high-frequency time series. Every storage paradigm solves part of the problem and fails at the rest.
MindCODE's Data Cloud was designed from the ground up for one specific class of data: longitudinal human behavior across multiple biological and environmental scales. This post describes the architecture, the data model, and how it works in practice.
Why Behavior Data Breaks Standard Approaches
Consider what a comprehensive behavioral record actually contains for a single patient over one year of treatment for major depressive disorder:
- 52 weekly PHQ-9 assessments (structured, ordinal scale, 9 items each)
- 365 days of actigraphy at 30-second epochs (approximately 1 million time-series data points)
- 4 fMRI scans with resting-state connectivity matrices (approximately 35,000 region-pair correlations per scan)
- 12 monthly blood draws with inflammatory markers (IL-6, TNF-alpha, CRP) and BDNF levels
- Genomic data: a fixed genotype with ~500,000 SNP positions relevant to psychiatric pharmacogenomics
- 52 weekly clinical notes averaging 400 words each (unstructured text)
- Environmental context: ZIP code-level socioeconomic data, seasonal daylight hours, local air quality indices
Now try putting this in a single Postgres database. The PHQ-9 scores fit neatly in a table. The actigraphy data needs a time-series store. The fMRI connectivity matrices are dense numerical arrays. The genomic data is sparse and mostly static. The clinical notes need a text index. The environmental data is geospatial and temporal.
A relational database would require so many join patterns that query performance would collapse. A data lake (Parquet files on S3) would preserve the raw data but lose the relationships between modalities. You could build a custom pipeline for each analysis, but every new research question would require a new pipeline.
The Six Data Domains
MindCODE organizes behavioral data into six domains, each with its own storage strategy, temporal characteristics, and access patterns:
Domain 1: Genetics and Epigenetics
- Data types: SNP arrays, whole-genome sequencing, DNA methylation arrays (Illumina EPIC 850K), gene expression profiles (RNA-seq)
- Temporal characteristics: Genotype is static. Epigenetic markers change over weeks to months. Gene expression changes over hours to days.
- Storage: Variant call format (VCF) files indexed by genomic coordinate, methylation beta values in columnar store, expression matrices in HDF5
- Volume: ~5GB per patient for WGS, ~50MB for SNP array, ~200MB for methylation
Domain 2: Cells and Tissue
- Data types: Blood biomarkers (inflammatory cytokines, neurotrophic factors, hormones), cerebrospinal fluid markers, peripheral immune cell phenotyping
- Temporal characteristics: Sampled at clinical visits, typically weekly to monthly
- Storage: Structured time-series with lab reference ranges and assay metadata
- Volume: Small per sample (~1KB), but longitudinal accumulation matters
Domain 3: Brain Imaging
- Data types: Structural MRI (T1-weighted volumetrics), functional MRI (resting-state and task-based connectivity), diffusion tensor imaging (white matter tracts), EEG (spectral power, event-related potentials), MEG
- Temporal characteristics: Scans at baseline, mid-treatment, and follow-up. EEG may be continuous for days.
- Storage: NIfTI volumes in object store, derived connectivity matrices in array store, EEG in EDF+ format with epoch-level feature extraction
- Volume: ~500MB per fMRI session (raw), ~2MB per connectivity matrix (derived)
Domain 4: Digital Phenotype
- Data types: Accelerometry, GPS traces, phone usage patterns, sleep staging, heart rate variability, skin conductance, keystroke dynamics, voice prosody features
- Temporal characteristics: Continuous or near-continuous (seconds to minutes resolution)
- Storage: Time-series database (Apache IoTDB) with configurable downsampling, raw data in object store for reprocessing
- Volume: ~50MB/day for a typical wearable sensor suite
Domain 5: Symptoms and Clinical Scales
- Data types: Standardized instruments (PHQ-9, GAD-7, MADRS, HAMD-17, C-SSRS, PCL-5), ecological momentary assessments (EMA), clinician-rated scales, patient-reported outcomes
- Temporal characteristics: Weekly to monthly for clinic-administered scales, multiple daily for EMA
- Storage: Structured tables with instrument metadata, item-level responses, and computed subscale scores
- Volume: Small per assessment, but high frequency for EMA
Domain 6: Environment and Population
- Data types: Socioeconomic indicators (ADI, SVI), air quality (PM2.5, O3), daylight hours, temperature, neighborhood walkability scores, social determinants of health (SDOH), population-level prevalence data
- Temporal characteristics: Updated daily to annually depending on source
- Storage: Geospatial-temporal store linked to patient location history
- Volume: Shared across patients in the same geographic area
The Canonical Longitudinal Record
The core abstraction in MindCODE's Data Cloud is the Canonical Longitudinal Record (CLR) — a unified, time-indexed representation of a single patient's data across all six domains.
The CLR is not a single table or document. It is a virtual entity — a set of pointers and access methods that present a coherent longitudinal view while allowing each domain to use its optimal storage backend.
CLR: Patient P-20481
├── timeline
│ ├── 2025-06-01 ... 2026-03-15 (active observation window)
│ └── resolution: per-domain (seconds for actigraphy, weeks for scales)
├── genetics
│ ├── genotype: VCF ref → gs://mindcode-genomics/P-20481/wgs.vcf.gz
│ ├── methylation[2025-06-01]: beta values → array-store/P-20481/methyl/baseline
│ └── methylation[2025-12-01]: beta values → array-store/P-20481/methyl/6month
├── biomarkers
│ ├── IL-6: [(2025-06-01, 4.2pg/mL), (2025-07-01, 3.8pg/mL), ...]
│ └── BDNF: [(2025-06-01, 18.4ng/mL), (2025-07-01, 21.1ng/mL), ...]
├── imaging
│ ├── fMRI[2025-06-01]: connectivity → array-store/P-20481/fmri/baseline
│ └── fMRI[2025-09-01]: connectivity → array-store/P-20481/fmri/3month
├── digital_phenotype
│ ├── actigraphy: ts-db/P-20481/accel (30s epochs)
│ ├── sleep: ts-db/P-20481/sleep (nightly summaries + epoch-level)
│ └── gps_mobility: ts-db/P-20481/gps (hourly radius of gyration)
├── scales
│ ├── PHQ-9: [(2025-06-01, 19), (2025-06-08, 18), ..., (2026-03-15, 9)]
│ └── GAD-7: [(2025-06-01, 14), (2025-06-08, 13), ...]
└── environment
├── location: ZIP 02138 (Cambridge, MA)
├── ADI_national_rank: 22
└── daylight_hours: time-series linked to location
Cross-Domain Queries
The CLR enables queries that span domains without custom joins:
-- Find patients where PHQ-9 improved >50% AND
-- default mode network connectivity decreased AND
-- sleep efficiency improved >15 percentage points
SELECT patient_id
FROM canonical_longitudinal_record
WHERE scales.PHQ9.percent_change(baseline, week_12) < -50
AND imaging.fMRI.dmn_connectivity.change(baseline, month_3) < 0
AND digital_phenotype.sleep.efficiency.change(baseline, week_12) > 15This query touches three different storage backends (structured tables, array store, time-series database) but presents as a single logical operation to the researcher.
The Feature Factory
Raw longitudinal data is not directly useful for most modeling tasks. You need derived features: slopes, variability measures, change points, cross-domain correlations. Computing these features from scratch for every analysis is wasteful and error-prone.
MindCODE's Feature Factory is a governed computation layer that transforms raw CLR data into reusable behavioral features:
- Trajectory features: PHQ-9 slope over 4 weeks, sleep efficiency coefficient of variation, step count trend
- Cross-domain features: correlation between daily mood EMA and next-night sleep efficiency, lagged relationship between IL-6 levels and PHQ-9 score changes
- Derived biomarkers: heart rate variability (RMSSD) computed from raw RR intervals, circadian rhythm stability index from actigraphy
- Population-relative features: patient's PHQ-9 trajectory compared to the cohort mean trajectory, percentile rank for treatment response speed
Each computed feature carries full provenance: the source data versions, the computation code version, the parameters used, and the consent status of the input data. If a patient withdraws consent for wearable data, every feature derived from that data is flagged and excluded from future analyses.
Feature Versioning
Features are versioned independently of source data. When the sleep staging algorithm improves, new feature versions are computed in parallel while old versions remain available for reproducibility. A published study can always reference the exact feature version it used.
Governed Access and Consent-Aware Workflows
Every CLR query passes through MindCODE's policy engine before execution. The policy engine evaluates:
- Researcher identity and role: authenticated via institutional IdP
- Dataset access policy: which data domains are available for this study protocol
- Per-patient consent directives: which patients have consented to this specific use case
- De-identification requirements: whether the approved use requires k-anonymity, differential privacy, or full de-identification
- Data use agreement (DUA) constraints: some datasets have DUA terms that restrict export, linking, or commercial use
The policy evaluation result — approved, denied, or partially approved with restrictions — is logged alongside the query in the audit trail.
Consent Granularity
MindCODE supports consent directives at the domain level. A patient might consent to:
- Sharing scale scores and clinical notes with their treatment team (Domains 5 and clinical records)
- Sharing de-identified scale scores and digital phenotype data for research (Domains 4 and 5, de-identified)
- Excluding genomic data from all external sharing (Domain 1 restricted)
- Sharing brain imaging with a specific multi-site research consortium (Domain 3, named recipient)
The CLR respects these directives at query time. A researcher querying cross-domain features will only see results for patients whose consent covers all domains involved in that feature.
Building on the Data Cloud
The Data Cloud and CLR are the foundation for everything else in MindCODE: the Virtual Human system reasons over CLRs, the Clinical Intelligence Suite computes treatment response predictions from CLR features, and the research tools enable cohort analyses across thousands of CLRs.
In the next post, we will explain how the Virtual Human system uses the CLR to build comprehensive patient models that support clinician decision-making — without replacing clinical judgment.