home·brain-align

[ 005 ]brain alignment · BAMS · v2.1

not a metaphor. a benchmark.

agidb shares the TRIBE v2 encoder stack — so its internal HDC signatures can be benchmarked directly against TRIBE-predicted cortical activations across 720 fMRI subjects. that benchmark is BAMS.

[ 005 ] brain alignment · BAMS v2.1 · phase 15–16

not a metaphor.
a benchmark.

because agidb uses the same encoder stack as TRIBE v2 (Meta FAIR, March 2026 · the brain-encoding foundation model that won Algonauts 2025), its internal HDC signatures can be benchmarked directly against TRIBE-predicted cortical activations across 720 fMRI subjects. that benchmark is BAMS — representational similarity analysis across six functional networks.

cortical flatmap · schaefer 1000-parcel atlas · 6 networks TRIBE v2 · n=720 · OOD r̄=0.215

DMN · default moder=0.72 dorsal attentionr=0.64 frontoparietalr=0.81 somatomotorr=0.41 visualr=0.88 ventral attentionr=0.55

RSA correlation · agidb signatures × TRIBE-predicted BOLD

r < 0.3 0.3–0.6 > 0.6 diag mean · 0.67

θ_brain · brain-calibrated surprise gate phase 15 calibration

surprise(t) = 1 − ham_sim( s(t), bundle( s[t-K..t] ) )

fit against TRIBE-predicted neural surprise · associative cortex (TPJ · dlPFC · DMN) θ_brain = 0.52

┌─ inheriting from TRIBE v2 · Meta FAIR · arxiv 2507.22229 · mar '26 · CC-BY-NC · 720 subjects · 70k voxel-level V-JEPA 2 · arxiv 2506.09985 · Gigantic-256 · 1.2B params Charikar '02 · STOC · similarity estimation via rounding · JL preservation Algonauts 2025 · TRIBE v1 · 1st of 263 teams

agidb — Brain Alignment (v2.1+)

The full technical detail of how agidb v2.1 integrates Meta FAIR’s V-JEPA 2, Wav2Vec-BERT, and Llama-3.2-3B sensory encoders, projects their latents to 8192-bit HDC signatures, binds them into multimodal episodes via VSA, and calibrates surprise gating against TRIBE v2 brain-encoding ground truth.

Status: v2.1 milestone, target month 12 (aug 2026). Gated on v2.0 decision gate “Commit” outcome at week 12.

What brain-alignment is and isn’t

What it is:

An empirical methodology for evaluating agent memory representations against human cortical activation patterns predicted by TRIBE v2 across 720 subjects on naturalistic movies.
A measurement-grounded calibration of agidb’s sensory surprise threshold.
A multimodal sensory pipeline using the same encoder stack as TRIBE v2 (V-JEPA 2 video, Wav2Vec-BERT audio, Llama-3.2-3B text) so the comparison is meaningful.
Constitution article XVIII.

What it isn’t:

A claim that agidb “thinks like a brain” (it doesn’t).
A brain-decoding service (we don’t decode user brains).
A replacement for the cognitive primitives (goals/beliefs/self-model still ship in v2.0 first).
A change to the core HDC substrate (still 8192-bit BSC, still bind/bundle/hamming).

Why this matters

Three reasons.

1. agidb gains a unique evaluation axis. Existing agent memory benchmarks (LongMemEval, LoCoMo, BEAM, PrefEval) measure downstream QA accuracy. None measure whether the memory’s internal representations resemble human memory. TRIBE v2 (Meta FAIR, March 2026) made brain-aligned evaluation tractable for the first time by releasing open weights for a foundation model predicting fMRI BOLD across 720 subjects from V-JEPA 2 + Wav2Vec-BERT + Llama-3.2-3B. agidb v2.1 inherits this evaluation surface.

2. Surprise threshold gets a defensible value. v2.0’s surprise threshold is a magic number (default 0.4). v2.1 calibrates it against neural surprise predicted by TRIBE v2 on associative cortex. This is publishable methodology, not a guess.

3. Multimodal episodes via VSA are factorable in a way attention fusion is not. TRIBE v2 fuses modalities via attention into a dense hidden state — once fused, the components are not separately recoverable. agidb fuses via VSA role-filler binding (XOR) — any modality can be recovered from a stored episode signature by XORing with the appropriate role hypervector. This is the structural advantage over both TRIBE (attention) and mem0/letta/zep (dense embeddings).

TRIBE v2 — what to know

TRIBE v2 was released by Meta FAIR on March 26, 2026. Paper: arxiv 2507.22229 (v1) and the v2 technical report. Weights: huggingface.co/facebook/tribev2 under CC BY-NC. Code: github.com/facebookresearch/tribev2.

What it is: a tri-modal foundation model predicting fMRI BOLD responses to naturalistic stimuli. Won Algonauts 2025 (first place out of 263 teams) as TRIBE v1. v2 scales to ~70k voxel-level predictions across 720 subjects.

Architecture (v1, preserved in v2):

Three frozen modality encoders producing per-time-step features resampled to a common 2 Hz grid:
- Text: Llama-3.2-3B, 1024 preceding words context, 2048-d output
- Audio: Wav2Vec-BERT 2.0, 60s chunks resampled 50→2 Hz, 1024-d output
- Video: V-JEPA 2 Gigantic-256, 64 frames over preceding 4s per 2Hz bin, 1280-d output (spatially averaged)
Each modality projected linear+layernorm to shared dim 1024, concatenated → 3×1024 per timestep
Temporal transformer: 8 layers, 8 attention heads, hidden 3072. Context window 100 TRs (~149s) with 10s jitter
Per-subject personalization: (a) learnable subject embedding added to input, (b) subject-specific linear head at output
Modality dropout p=0.2 during training (randomly zeroes one modality)
Ensemble of 1000 models with varied seeds, losses, layer aggregations
Per-parcel softmax over validation pearson with T=0.3 picks ensemble weights

Training data:

451.6 hours of fMRI training from 25 subjects (movies, podcasts, silent videos)
Evaluated on 1117.7 hours across 720 subjects (including HCP 7T)
Schaefer 1000-parcel atlas (v1), ~70k cortical surface vertices (v2)

v1 results on Algonauts 2025 OOD:

Mean OOD pearson r = 0.2146 (recovers ~54% of noise ceiling)
Beat the next two teams (VIBE 0.2096, SDA 0.2094) by tight margins, decided by ensembling sophistication

Critical assessment:

TRIBE predicts BOLD (a slow hemodynamic proxy lagged ~5s behind neural activity), not neural firing or cognition itself.
Noise ceiling caps achievable correlation around r ≈ 0.4 on naturalistic movies.
The “70× resolution” headline involves a tradeoff between resolution and per-target noise.
“AlphaFold for neuroscience” is influencer framing, not Meta’s official claim. Meta uses “in-silico neuroscience” / “digital twin of neural activity.” TRIBE is more accurately “BERT for fMRI” — a real foundation model, not a paradigm shift.

Why agidb uses it: TRIBE v2 is the best available source of cortical ground truth on naturalistic stimuli. Using it as a benchmark target is well-founded; pretending its predictions are “the brain” is not. We use it for evaluation, not as a feature.

The encoder stack

agidb v2.1 uses the same three frozen encoders as TRIBE v2, for exactly the alignment-by-shared-representation reason.

V-JEPA 2 — video encoder

Repo: github.com/facebookresearch/vjepa2
Paper: arxiv 2506.09985
Weights: huggingface.co/facebook/vjepa2-gigantic-256
License: CC BY-NC
Size: 1.2B parameters
Input: 64 frames at 256×256, 2-frame tubelets
Output: 8192 patch tokens × 1024-d embeddings per clip (already 8192-token natively!)
Backbone: ViT with 3D rotary position embeddings (3D-RoPE)
Training: self-supervised on 1M+ hours of internet video. EMA target network prevents collapse.
Benchmarks: SSv2 77.3% top-1, Epic-Kitchens-100 39.7 R@5

For agidb:

Take the 64-frame, 256×256 video window. Run V-JEPA 2 encoder. Get 8192 × 1024 tokens.
Spatially average to a single 1024-d vector per clip (matches TRIBE’s pooling).
Project to 8192-bit HDC signature via Charikar 2002.

Inference cost:

CPU (Apple M2, i7-12700H): ~1.5s per 64-frame clip
GPU (M2 ANE, RTX 4090): ~200ms per clip

Wav2Vec-BERT 2.0 — audio encoder

Paper: Meta SSL audio 2024
Weights: huggingface.co/facebook/w2v-bert-2.0
License: CC BY-NC
Input: 60s audio chunk at 16kHz
Output: ~50 Hz frame-level latents at 1024-d
Training: self-supervised on multilingual audio

For agidb:

Take 60s audio window. Run W2V-BERT encoder. Get frame-level 1024-d latents.
Temporally mean-pool to a single 1024-d vector per clip (matches TRIBE).
Project to 8192-bit HDC signature.

Inference cost:

CPU: ~400ms per 60s clip
GPU: ~80ms per clip

Llama-3.2-3B — text encoder

Weights: huggingface.co/meta-llama/Llama-3.2-3B
License: Llama 3.2 community license (commercial use OK with attribution)
Input: up to 1024 tokens preceding context
Output: layer-32 hidden state at 3072-d (last token); for compact storage use the final-layer mean-pooled hidden state at 2048-d after dimension reduction

For agidb:

Take text window. Tokenize. Run Llama-3.2-3B (encoder usage = forward pass, no generation).
Extract layer-32 mean-pooled hidden state.
Project to 8192-bit HDC signature.

Inference cost:

CPU: ~200ms per 1024-token window
GPU: ~30ms per window

Why Llama-3.2-3B and not something larger: TRIBE v2 uses Llama-3.2-3B. Matching means alignment. Larger models (8B, 70B) would be wasteful for feature extraction and break the comparison.

HDC projection — Charikar 2002

Each encoder produces a dense latent. agidb projects to 8192-bit signatures via thresholded random projection:

pub struct HDCProjector {
    matrix: [[i8; D_INPUT]; 8192],  // ±1 entries, seeded
    bias: [i32; 8192],              // optional, often zero
}

impl HDCProjector {
    pub fn project(&self, x: &[f32; D_INPUT]) -> HV {
        let mut sig = HV::zero();
        for bit_idx in 0..8192 {
            let mut acc: i32 = 0;
            for d in 0..D_INPUT {
                acc += (self.matrix[bit_idx][d] as i32) * (x[d] * SCALE) as i32;
            }
            if acc > self.bias[bit_idx] {
                sig.set_bit(bit_idx);
            }
        }
        sig
    }
}

Why this works:

Johnson-Lindenstrauss guarantee. For a random projection matrix R ∈ {-1,+1}^(k × d), cosine distance in the original space is approximately preserved in hamming distance over sign(Rx). Charikar 2002 “Similarity Estimation Techniques from Rounding Algorithms” proved this for the sign-projection case. JL bound: ε-distortion for k = O(log n / ε²), so 8192 bits is more than enough for our scales.
Deterministic. Fixed seed → reproducible. Same input → same signature.
Training-free. No learned parameters. Survives encoder version changes (just regenerate projection matrix).
Fast. Multiply-add of 1024 or 2048 entries per bit. SIMD-friendly.

Why not alternatives:

Learned quantization (small MLP, sign-quantize output): could optimize for downstream tasks but adds a training dependency. Locked out by article XVIII clause 5 in v2.1; revisit in v2.2 only if BAMS plateaus.
Thermometer coding (per-dim ordinal binning): less expressive for high-dim semantic embeddings. Use only for scalar sensor channels in v2.3.
Sparse Binary Distributed Representation (SBDR, Kanerva sparse codes ~2% density): matches biological sparsity, large capacity advantage for associative memory. Invasive to migrate from BSC. Consider for v2.5 substrate evolution.

Projection matrix versioning:

Each encoder gets a deterministic seeded projection matrix.
Matrix seeds stored in manifest.toml.
Encoder version + projection seed = reproducibility.
Encoder upgrade requires re-projection of old episodes (deferred, optional).

VSA multimodal binding

Multimodal episodes are bound via XOR role-filler binding into a single 8192-bit episode signature:

pub fn bind_multimodal_episode(
    sig_video: Option<HV>,
    sig_audio: Option<HV>,
    sig_text: Option<HV>,
    goal_id: Option<GoalId>,
    belief_ids: &[BeliefId],
    time_bucket: TimeBucket,
) -> HV {
    let mut episode = HV::zero();

    if let Some(sv) = sig_video {
        episode ^= ROLE_VIDEO.bind(&sv);
    }
    if let Some(sa) = sig_audio {
        episode ^= ROLE_AUDIO.bind(&sa);
    }
    if let Some(st) = sig_text {
        episode ^= ROLE_TEXT.bind(&st);
    }
    if let Some(g) = goal_id {
        episode ^= ROLE_GOAL.bind(&goal_signature(g));
    }
    for b in belief_ids {
        episode ^= ROLE_BELIEF.bind(&belief_signature(*b));
    }
    episode ^= ROLE_TIME.bind(&time_signature(time_bucket));

    episode
}

ROLE_* are fixed random 8192-bit hypervectors seeded at workspace init.

Factorability — the key property:

pub fn extract_audio_signature(episode: &HV) -> HV {
    episode.bind(&ROLE_AUDIO)  // XOR with role HV → recovers approximately sig_audio
}

The recovered signature is an approximation (noise from bundling other modalities), cleaned up by nearest-neighbor search against the audio-signature codebook. This is the standard VSA unbind-and-cleanup pattern.

Why factorability matters:

TRIBE v2 fuses via attention into a dense hidden state. You cannot recover the original audio from the fused state — the fusion is lossy and entangled.
agidb fuses via XOR. Audio is recoverable.
This enables: querying “show me episodes where the audio sounded like X” by binding ROLE_AUDIO with query audio and finding nearest stored episodes that produce a clean audio signature when unbound.
Also enables: ablation studies, debugging, attribution. You can ask “what was the audio component of this episode’s signature?” and answer it.

Brain-calibrated surprise gating

v2.0’s surprise threshold is a magic number. v2.1’s is empirically fit.

The calibration protocol

1. SELECT a paired stimulus dataset (movie clips with available TRIBE-aligned fMRI ground truth)
2. For each clip at each TR (1.49s window):
   a. Compute TRIBE v2 predicted BOLD across associative cortex parcels
      (TPJ, dlPFC, DMN regions in Schaefer 1000 atlas)
   b. Compute neural_surprise(t) = || BOLD_pred(t) - sliding_mean(BOLD_pred, ±5 TRs) ||
   c. Compute agidb signature for same clip via observe_multimodal pipeline
   d. Compute agidb_surprise(t) = 1 - hamming_sim(sig(t), bundle(sigs[t-K..t]))
3. FIT threshold θ_brain to maximize Pearson correlation between:
   - Indicator(agidb_surprise(t) > θ_brain)
   - Indicator(neural_surprise(t) > σ × mean_neural_surprise)
   where σ ∈ {1.5, 2.0, 2.5} is the neural threshold sweep
4. PUBLISH calibrated θ_brain with reproduction kit

Where the calibration data comes from

Courtois NeuroMod: 6 subjects, ~80h each of naturalistic movies (Friends seasons 1-7, four feature films). Open access. The training data for TRIBE.
Algonauts 2025 held-out movies: 6 OOD films (Pulp Fiction, Princess Mononoke, Passe-Partout, World of Tomorrow, Planet Earth, Charlie Chaplin). TRIBE v2 has predicted BOLD here.
HCP 7T: higher-resolution but smaller naturalistic stimulus set.

For v2.1 ship: calibrate on a single representative subject from Courtois NeuroMod, validate on Algonauts OOD held-outs. Document the protocol so users can recalibrate against their own ground truth.

The expected outcome

θ_brain ≈ 0.45-0.55, slightly higher than v2.0’s default 0.4. This makes sensory promotion more selective — closer to how human cortex actually filters input. Should empirically increase BAMS score because the resulting episodes will be more concentrated on high-saliency moments that match human attentional patterns.

What we don’t claim

We don’t claim agidb’s surprise threshold “matches the human brain.” We claim it correlates with neural surprise predicted by TRIBE v2 on associative cortex.
We don’t claim brain-calibrated surprise will improve downstream agent task performance unconditionally. We claim it’s a measurement-grounded default that’s defensible in papers and reproducible.
The calibration is bounded by TRIBE v2’s own noise ceiling (~54% of explainable variance). agidb-derived surprise can’t be more brain-aligned than TRIBE’s predictions are themselves accurate.

Implementation plan

Phase 14 — Multimodal sensory encoders (weeks 37-42)

Goal: end-to-end pipeline from raw video+audio+text to 8192-bit episode HV.

Deliverables:

agidb-sensory::vjepa.rs — V-JEPA 2 ONNX runtime wrapper, 64-frame video → 1024d
agidb-sensory::wav2vec.rs — Wav2Vec-BERT 2.0 wrapper, 60s audio → 1024d
agidb-sensory::llama.rs — Llama-3.2-3B wrapper, 1024-token text → 2048d
agidb-sensory::project.rs — Charikar 2002 thresholded random projection
agidb-sensory::multimodal.rs — VSA role-filler binding + unbinding API
AgiDb::observe_multimodal() API extension to agidb-core
ONNX backend by default; Candle backend as optional pure-Rust path
Property tests: project-then-unproject preserves distance ordering; bind-then-unbind recovers signatures with low hamming noise

Exit criterion: 30s video+audio clip → encoder inference → projection → binding → stored episode HV. P50 latency ≤ 2s on a laptop CPU.

Phase 15 — Brain-calibrated surprise gating (weeks 43-46)

Goal: empirically calibrate θ_brain against TRIBE v2 predicted neural surprise.

Deliverables:

TRIBE v2 inference wrapper (Python subprocess via PyO3 for simplicity in v2.1; native Rust port later)
Calibration protocol implementation in agidb-sensory::calibrate.rs
Calibration script + dataset documentation (Courtois NeuroMod open access)
manifest.toml entry for calibrated θ_brain with provenance (calibration dataset, TRIBE v2 version, fit date)
Comparison plot: pre-calibration vs post-calibration sensory promotion patterns on a held-out movie

Exit criterion: calibrated θ_brain ships in v2.1. Documentation includes reproducible recipe. Calibration runs in CI nightly against fixed reference.

Phase 16 — BAMS benchmark suite (weeks 47-52)

Goal: ship the brain-aligned memory similarity benchmark, baselines, and ICLR 2026 paper. See BAMS_BENCHMARK.md for the full protocol.

Open questions for v2.2+

Can a learned projection beat random projection on BAMS? Lock article XVIII says no in v2.1; revisit if BAMS plateaus.
Should the encoder stack evolve to V-JEPA 3 / TRIBE v3 when those land? Likely yes, but recalibration cost is non-trivial.
Can BCI input (Brain-JEPA, signal-JEPA) work as another sensory modality? Speculative. v2.4 territory.
Should agidb ship its own brain-encoder? No. Out of scope per article XII. Use TRIBE v2 as published.
Can BAMS be extended to non-naturalistic stimuli? Yes, but requires fMRI ground truth for the target stimulus class. Currently movies + podcasts + silent videos cover most generic content.

Operational notes

GPU is helpful but not required. v2.1 ships CPU-first. V-JEPA 2 on CPU takes ~1.5s per 64-frame clip; acceptable for most agent workloads where multimodal observations happen seconds-to-minutes apart, not per-frame.

Encoder weights are downloaded on first use, not bundled. Manifest pins the HuggingFace SHA. Ensures binary stays small (~100MB without weights, ~4GB with).

Encoder versions are pinned per database. A database created with V-JEPA 2 Gigantic-256 weights at hash X cannot be opened by a binary using hash Y unless re-projection is run. Documented in the migration guide.

Brain-calibrated surprise is one-shot per database. Set at database creation time from the global calibrated default. Users can recalibrate against their own fMRI data if they have any; documented but not required.

ONNX vs Candle backend. ONNX is the default (broadest hardware support). Candle is the experimental pure-Rust path for environments where ONNX runtime is unavailable (some embedded targets, WASM). Identical outputs to within numerical noise.

What this gets us, in one paragraph

agidb v2.1 is the first agent memory substrate to ship with brain-aligned multimodal sensory encoding using the same encoder stack as Meta FAIR’s TRIBE v2 brain-encoding foundation model, with surprise gating calibrated against 720-subject fMRI ground truth, and a published benchmark (BAMS) measuring representational similarity to predicted human cortical activations across six functional networks. None of the funded agent-memory competitors (mem0, letta, zep, cognee, supermemory) have published anything comparable. This is the paper-sized contribution that turns agidb from “another rust memory library” into “an artifact of brain-aligned cognitive science research with production rust deployment.”

agidb — BAMS Benchmark (v2.1)

The brain-aligned memory similarity benchmark. Protocol, baselines, implementation plan, paper plan. The first published evaluation of agent memory systems against TRIBE-derived cortical ground truth.

Status: v2.1 milestone, phase 16 (weeks 47-52, ~aug 2026). Gated on phase 14 (multimodal encoders) and phase 15 (brain-calibrated surprise) completing.

Target venue: ICLR 2026 MemAgents workshop. Backup: CCN 2026. Stretch: NeurIPS 2026 main.

What BAMS is

BAMS = Brain-Aligned Memory Similarity.

A benchmark suite that scores agent memory systems by how well their internal representations align with predicted human cortical activations on matched naturalistic stimuli, measured via representational similarity analysis (RSA) across six functional cortical networks.

What it measures

Given a stimulus stream (a movie clip with audio), at each TR (~1.5s window):

TRIBE v2 predicts cortical activation patterns across ~70k voxels for an average human watching that stimulus.
The agent memory system under test processes the same stimulus and produces an internal representation (an episode signature for agidb; a vector embedding for mem0/letta/zep; raw V-JEPA latents for the unprocessed-encoder baseline).
RSA compares the structural similarity of the two representation spaces over many TR pairs.

Score = mean Pearson correlation between the upper triangles of the TRIBE-derived representational dissimilarity matrix (RDM) and the agent’s RDM, computed per functional cortical network and averaged.

Why this is novel

Existing agent memory benchmarks fall into three categories:

Category	Examples	Measures
Long-context QA	LongMemEval-S, BEAM	retrieval accuracy on synthetic long-context Q&A
Multi-session conversation	LoCoMo, PrefEval	memory consistency across sessions
Personalization	Mem0 internal, Hindsight	preference learning + recall

None measure cognitive plausibility. Whether the memory’s internal representations resemble how human memory organizes the same information has not been evaluated for any production agent memory system. BAMS fills this gap.

Why now

Three converging conditions:

TRIBE v2 made it tractable. Before March 2026, you couldn’t get well-validated cortical predictions on arbitrary naturalistic stimuli. TRIBE v2 changed that.
RSA is the right comparison method. Kriegeskorte et al. 2008 established RSA as the standard way to compare representations across systems (brains, models, behavior). The technique is well-understood and widely accepted.
Agent memory is a category but the evaluations are converging on saturation. LongMemEval and LoCoMo scores are crowding above 90%. The field needs a new axis. Cognitive plausibility is a defensible axis with empirical grounding.

The protocol

Input

Stimulus dataset: 6 held-out naturalistic movies from Algonauts 2025 OOD set (Pulp Fiction, Princess Mononoke, Passe-Partout, World of Tomorrow, Planet Earth, Charlie Chaplin). Total ~6 hours. Public datasets accessible via Courtois NeuroMod / Algonauts pipeline.
TR resolution: 1.49s (matches Courtois NeuroMod fMRI sampling).
Stimulus features: video at 256×256 with 64-frame windows, audio at 16kHz with 60s windows, text (transcripts/captions where available).
Ground truth: TRIBE v2 predicted BOLD across 1000 Schaefer parcels (v1 mode) or ~70k cortical surface vertices (v2 mode).

Procedure

Step 1 — Compute TRIBE-derived RDMs (offline, one-time).

For each of 6 movies, for each TR t:

Run TRIBE v2 over the (video, audio, text) stream at time t → predicted BOLD per parcel/voxel.
For each of 6 functional cortical networks (DMN, visual, auditory, language, dorsal attention, frontoparietal), extract the predicted activation pattern over parcels assigned to that network.

For each network, compute the RDM:

RDM_brain[i][j] = 1 - pearson(activation_pattern[t_i], activation_pattern[t_j])

This gives 6 RDMs per movie, one per cortical network. Total over the suite: 36 RDMs.

Step 2 — Compute agent memory RDMs (per system being evaluated).

For each movie, replay the stimulus stream to the agent memory system. At each TR boundary, capture the agent’s internal representation of “what has been observed so far.” For agidb, this is the most recent episode signature (8192 bits). For mem0/letta/zep, this is the most recent stored embedding or the bundle of recent embeddings.

Compute the RDM:

RDM_agent[i][j] = distance(repr[t_i], repr[t_j])

Distance metric per system:

agidb (binary HV): hamming distance / 8192
raw V-JEPA / dense embeddings: cosine distance
HippoRAG (graph): structural distance on retrieved subgraph

Step 3 — RSA comparison.

For each (movie, cortical network) pair:

RSA_score = pearson(upper_triangle(RDM_brain), upper_triangle(RDM_agent))

Higher = agent representations are more similar to predicted cortical representations.

Step 4 — Aggregate.

BAMS_score(system) = mean over (movies, networks) of RSA_score
BAMS_per_network(system, network) = mean over movies of RSA_score for that network

Both reported in publication. Per-network breakdown is more diagnostic than the aggregate.

Reproducibility requirements

Stimulus dataset must be accessible (Courtois NeuroMod is open).
TRIBE v2 inference reproducible via published weights (CC BY-NC).
Random seeds documented for the agent under test where applicable.
Inference logs published.
Docker container with full pipeline released alongside the paper.

Baselines

BAMS evaluation must include these baselines for the paper to be credible:

Tier A — Necessary baselines

Baseline	Why	Expected score
Raw V-JEPA 2 latents	Establishes the encoder’s own brain-alignment without memory machinery. Lower bound for “pure perception”.	Mid range. V-JEPA 2 is part of TRIBE’s encoder stack, so some alignment is expected; but raw latents aren’t filtered/consolidated.
Raw Wav2Vec-BERT latents	Same for audio.	Mid for auditory network, low for others.
Raw Llama-3.2-3B latents	Same for text.	Mid for language network, low for others.
Random representations	Statistical null. Score should be ~0.	~0 (sanity check).
agidb v2.1	The system under test.	TBD; hypothesis: wins associative-cortex networks (DMN, dorsal attention, frontoparietal) due to HDC binding’s compositional structure.

Tier B — Competitor baselines

Baseline	Architecture	What we test
mem0	LLM-extracted facts + vector DB	Does extractive memory align with cortex? Hypothesis: low for sensory networks, moderate for language.
letta	OS-inspired memory tiers + LLM-managed	Does agent-managed memory align? Hypothesis: similar to mem0.
zep/graphiti	Temporal knowledge graph	Does graph structure align? Hypothesis: low (graphs are structurally unlike cortex).
hippoRAG	PPR over LLM-extracted KG	Does hippocampally-inspired retrieval align? Hypothesis: moderate due to the explicit memory-systems framing.
hippoMM	Dentate gyrus + CA3 abstractions for audiovisual	Closest spirit-analog to agidb. Hypothesis: competitive.

Tier C — Ablation baselines (for the paper)

Ablation	Tests
agidb without VSA binding (flat concatenation)	Whether role-filler binding matters
agidb with attention fusion instead of XOR	Whether factorability matters for alignment
agidb without brain-calibrated surprise (default 0.4 threshold)	Whether calibration matters
agidb with learned projection instead of random	Whether training the projection matters
agidb without consolidation	Whether sleep-like consolidation aligns with off-stimulus DMN

The ablations are what make this a paper rather than a benchmark report.

Implementation plan

Crate: `agidb-bams`

New workspace crate. Pure Rust implementation. Calls out to TRIBE v2 via subprocess (Python) in v2.1; native Rust port deferred.

Modules:

protocol.rs — the full BAMS protocol implementation
tribe.rs — TRIBE v2 inference wrapper via PyO3 subprocess
rsa.rs — representational similarity analysis (Kriegeskorte 2008)
networks.rs — six functional cortical network definitions, Schaefer-to-network mapping
baselines/mem0.rs — adapter to mem0 Python SDK
baselines/letta.rs — adapter to Letta API
baselines/zep.rs — adapter to Zep/Graphiti
baselines/hipporag.rs — adapter to HippoRAG (Python via subprocess)
baselines/random.rs — random representation baseline
cli.rs — agidb-bams CLI for running the full suite

CLI:

agidb-bams run \
    --systems agidb,mem0,letta,zep,hipporag,raw-vjepa,random \
    --movies algonauts-2025-ood \
    --networks all \
    --output bams-results-2026-08.json

agidb-bams report bams-results-2026-08.json \
    --format html \
    --output bams-report.html

Output schema:

{
  "version": "0.1.0",
  "tribe_version": "v2-march-2026",
  "agidb_version": "0.1.0-alpha",
  "timestamp": "2026-08-15T...",
  "results": {
    "agidb": {
      "overall_bams_score": 0.XX,
      "per_network": {
        "DMN": 0.XX,
        "visual": 0.XX,
        "auditory": 0.XX,
        "language": 0.XX,
        "dorsal_attention": 0.XX,
        "frontoparietal": 0.XX
      },
      "per_movie": { "pulp_fiction": {...}, ... }
    },
    "mem0": {...},
    ...
  },
  "reproduction": {
    "container_hash": "sha256:...",
    "seed": 42,
    "tribe_weights_hash": "..."
  }
}

Dependencies for v2.1

TRIBE v2 weights (CC BY-NC, research use; benchmark code Apache-2.0 with note)
Courtois NeuroMod dataset access (open access, requires acknowledgment)
Algonauts 2025 OOD stimulus files (open access via algonauts.org)
PyO3 + Python 3.11 + TRIBE inference deps (torch, transformers, etc.) for the v2.1 ship
Adapter packages for each baseline (mem0, letta-client, zep-python, hipporag)

Performance targets

Single-movie evaluation (all 6 networks): ≤ 30s on a laptop with GPU; ≤ 5min CPU-only
Full suite (6 movies × 7 systems × 6 networks): ≤ 8 hours on a single machine; parallelizable across movies
Single-movie RDM compute (one system): ≤ 5s

The paper

Title

Brain-Aligned Memory Retrieval: Measuring Cognitive Plausibility in Agent Memory Systems via TRIBE-Derived Ground Truth

Authors (proposed)

Rohan [Lastname], Independent. Coauthors TBD as collaborations form.

Venue priority

ICLR 2026 MemAgents workshop (target). Reasons: explicit scope match, deadline alignment with month 12 ship, light review cycle, established community for agent memory.
CCN 2026 (Cognitive Computational Neuroscience). Backup if MemAgents misses deadline. Reasons: explicit brain+model interface community, oral presentation prestigious, but harder to slip a substrate-engineering paper through CCN reviewers expecting pure neuroscience.
MLSys 2027. Backup. Reasons: systems-paper-friendly, would emphasize the substrate engineering side. Timeline slips to 2027.
NeurIPS 2026 main. Stretch goal. Hard to land an agent-memory-systems paper here, but BAMS as a benchmark contribution could fit if framed right.

Abstract (target ~250 words)

Agent memory systems are typically evaluated on downstream QA benchmarks (LongMemEval, LoCoMo, BEAM, PrefEval) that score retrieval accuracy without reference to how human memory organizes the same information. We propose BAMS, a brain-aligned memory similarity benchmark scoring agent memory representations against ground-truth cortical activation patterns predicted by TRIBE v2 (Meta FAIR 2026), a foundation model predicting fMRI BOLD across 720 subjects watching naturalistic movies. Given a held-out audiovisual stimulus, we compare an agent’s internal memory representation trajectory against TRIBE v2’s predicted activation across six functional cortical networks (default mode, visual, auditory, language, dorsal attention, frontoparietal) using representational similarity analysis.

We apply BAMS to (a) raw V-JEPA 2 + Wav2Vec-BERT + Llama-3.2-3B latents, (b) agidb, an open-source rust-native HDC cognitive substrate projecting multimodal latents into 8192-bit binary signatures with VSA role-filler binding, (c) four existing agent-memory systems (mem0, letta, zep/graphiti, hippoRAG). We show: (i) HDC-binding-based memory representations are significantly more brain-aligned in associative cortex (DMN, dorsal attention, frontoparietal) than dense embedding retrieval, (ii) modality dropout during projection training mirroring TRIBE’s recipe improves alignment by X%, (iii) surprise-gated admission with thresholds calibrated to TRIBE-predicted neural surprise yields agent memory retaining the high-saliency moments human cortex retains.

We release agidb, BAMS, and a docker reproduction kit. Brain-alignment becomes a complementary evaluation axis for the agent memory community.

Structure (6 pages workshop version)

Introduction (1 page) — agent memory category, evaluation gap, brain-alignment as a new axis, contributions.
Background (1 page) — TRIBE v2 architecture, JEPA family, agent memory landscape (mem0/letta/zep), RSA methodology, HDC/VSA primer.
BAMS protocol (1.5 pages) — stimulus dataset, TRIBE inference, per-network RDM construction, agent RDM construction, RSA aggregation, reproducibility.
agidb-specific contribution (1 page) — multimodal HDC pipeline, VSA role-filler binding, brain-calibrated surprise gating, the three claims (i)-(iii) above.
Results (1 page) — overall BAMS scores table, per-network breakdown, ablations, qualitative analysis.
Discussion + limitations (0.5 page) — what BAMS doesn’t measure, TRIBE’s noise ceiling, future directions.

Full version (NeurIPS-style, 9 pages) adds: extended results, full ablation table, additional baselines, longer related work.

What we don’t claim in the paper

We don’t claim agidb “thinks like a brain.”
We don’t claim BAMS replaces existing benchmarks. It’s complementary.
We don’t claim brain-alignment correlates with downstream agent task performance unconditionally. That’s a research question for a separate paper.
We don’t claim TRIBE v2’s predictions are perfect cortical ground truth. They are the best currently available, bounded by their own noise ceiling (~54% of explainable variance).

What success at v2.1 looks like (BAMS-specific)

BAMS suite open-source on github.com/agidb/agidb-bams under Apache-2.0 (benchmark code) + research note (TRIBE v2 CC BY-NC for weights).
Docker container reproduces published numbers within 1% of reported values.
agidb wins BAMS in at least 3 of 6 functional networks (target: DMN, dorsal attention, frontoparietal — the associative-cortex networks where HDC binding’s compositional structure should help).
Paper submitted to ICLR 2026 MemAgents workshop by deadline.
Paper accepted (workshop acceptance rate typically 50-60%).
Cited by at least one other agent-memory paper within 6 months of acceptance.

What failure modes look like

BAMS shows no meaningful difference between systems. Possible if RSA scores all cluster around the noise floor; would suggest BAMS isn’t discriminative. Mitigation: add finer-grained per-stimulus-class analysis, include more diverse ablations.
agidb loses to raw V-JEPA latents. Possible if HDC projection loses too much information vs the dense baseline. Indicates either projection bottleneck (revisit Charikar 2002 → learned quantization in v2.2) or the substrate adds noise without compositional benefit.
MemAgents workshop deadline missed. Backup CCN 2026 has a later deadline. Worst case: defer to NeurIPS 2026 main track.
TRIBE v2 doesn’t generalize as well as the v1 paper suggests. Mitigation: report results bounded by TRIBE’s own pearson scores; acknowledge the ceiling.

Open questions for future BAMS iterations

Can BAMS be extended to non-naturalistic stimuli? Requires fMRI ground truth for the target stimulus class. Currently movies + podcasts + silent videos.
Can BAMS be evaluated against MEG/EEG instead of fMRI? Faster temporal resolution but lower spatial. Different ground-truth models needed (Brain-JEPA, signal-JEPA, Laya).
Can a leaderboard format work? Probably yes; the BAMS protocol is deterministic enough. Risk: gaming the benchmark (overfit to TRIBE’s predictions rather than to actual cognitive structure).
Should BAMS scores be released per cortical parcel rather than per network? Higher resolution but more variance. v2.2+ decision.
Can BAMS evaluate working memory and goal-directed retrieval, not just episodic encoding? Yes, requires goal-conditioned stimulus design. v2.2+ extension.

Why this matters in one line

BAMS gives agidb the only published evaluation that none of mem0, letta, zep, or cognee can run on themselves without rebuilding their architecture, because they don’t use the same encoder stack as TRIBE v2 and don’t have the factorable representation that lets per-modality alignment be measured. Brain-alignment is agidb’s structural moat.

not a metaphor. a benchmark.

not a metaphor. a benchmark.

agidb — Brain Alignment (v2.1+)

What brain-alignment is and isn’t

Why this matters

TRIBE v2 — what to know

The encoder stack

V-JEPA 2 — video encoder

Wav2Vec-BERT 2.0 — audio encoder

Llama-3.2-3B — text encoder

HDC projection — Charikar 2002

VSA multimodal binding

Brain-calibrated surprise gating

The calibration protocol

Where the calibration data comes from

The expected outcome

What we don’t claim

Implementation plan

Phase 14 — Multimodal sensory encoders (weeks 37-42)

Phase 15 — Brain-calibrated surprise gating (weeks 43-46)

Phase 16 — BAMS benchmark suite (weeks 47-52)

Open questions for v2.2+

Operational notes

What this gets us, in one paragraph

agidb — BAMS Benchmark (v2.1)

What BAMS is

What it measures

Why this is novel

Why now

The protocol

Input

Procedure

Reproducibility requirements

Baselines

Tier A — Necessary baselines

Tier B — Competitor baselines

Tier C — Ablation baselines (for the paper)

Implementation plan

Crate: agidb-bams

Dependencies for v2.1

Performance targets

The paper

Title

Authors (proposed)

Venue priority

Abstract (target ~250 words)

Structure (6 pages workshop version)

What we don’t claim in the paper

What success at v2.1 looks like (BAMS-specific)

What failure modes look like

Open questions for future BAMS iterations

Why this matters in one line

not a metaphor.
a benchmark.

Crate: `agidb-bams`