Most brain training app reviews rely on a quick week of play plus whatever the company's marketing page claims. We spend three weeks on each app, track four categories of metrics (engagement, in-app score trajectory, subjective cognitive experience, and external task performance), and cross-check advertised benefits against independent peer-reviewed research. This article documents the exact protocol so you can replicate it on any app we haven't covered yet. The short version: if a reviewer hasn't played an app for at least 21 days and hasn't run before/after external tasks, their conclusions about whether it “works” are guesses.
Brain training reviews online fall into two categories. Either the reviewer played the app for a weekend and wrote 1,500 words of vibes, or they parroted the app's own marketing claims and added an affiliate link. Neither approach tells you whether the app actually does what it says on the store page. When we wrote our Lumosity, Peak, Elevate, and CogniFit comparison, we wanted something more defensible — which meant designing a protocol first and following it through, even when the results were inconvenient for the company being reviewed. This article documents that protocol.
Why Publish Our Methodology?
Three reasons, all practical:
- Reproducibility. If another person follows our protocol on the same app, they should arrive at broadly similar conclusions. If they don't, we want to know why. Either our methodology has a gap or we missed something.
- Honesty about limits. Our testing isn't a clinical trial. Calling out what we can't measure matters more than claiming we measure everything.
- A template for readers. If you're considering an app we haven't reviewed yet, our protocol is something you can run yourself in three weeks. Readers email us asking about apps we've never touched, this is our answer to those requests.
The 3-Week Testing Protocol
Three weeks is the minimum window where two things can be meaningfully measured: whether the app retains your attention past initial novelty, and whether you see any cognitive transfer beyond just getting better at the app's own games. Shorter windows mostly measure the learning curve of the games themselves.
Week 1: Baseline + Onboarding
Day 1 starts with an external baseline, before touching the app. We run three short cognitive tasks that don't overlap with typical brain training games:
- Digit Span from the open-source PsyToolkit battery. Tests working memory capacity on a task no commercial app uses.
- N-Back (2-back condition). Tests working memory updating under load.
- Stroop task. Tests selective attention and inhibition. We use the PsyToolkit version for consistency.
Each task runs three times with 5-minute breaks between runs to get a stable baseline that accounts for warm-up effects. Then the app gets installed and we do whatever onboarding flow it offers, assessment quizzes, personalized plans, whatever is default.
Week 2: Daily Consistent Practice
15-20 minutes per day, every day, at roughly the same time (usually morning with coffee, which approximates how most users fit these apps into a routine). We complete whatever session the app recommends and avoid skipping games we find boring. Part of the review is judging whether the app's default recommendations are good, not whether we personally like specific games.
During this week we also track qualitative notes: which games felt engaging, which felt like time-fillers, where the difficulty ramp started to feel genuinely challenging, and what the app was asking us to do versus what the marketing page claimed.
Week 3: Harder Routine + Post-Test
We increase session length to 25 minutes per day and deliberately play games the app doesn't recommend by default, usually the ones targeting weaknesses the onboarding assessment identified. This tests whether the app's wider content library holds up, not just the auto-recommended loop.
On day 21, we repeat the baseline tasks (Digit Span, N-Back, Stroop) under the same conditions. If there's a meaningful gap between pre- and post-test scores, we flag it. We're not running a statistical test. That would require a control group and a lot more participants, but even a single-participant pre/post comparison tells you whether practice effects are large enough to notice. Small, subtle effects usually mean little real-world value.
Four Categories of Metrics
1. Engagement (Retention Signals)
Did we want to open the app? We count voluntary opens that aren't part of the protocol, session length beyond the required minimum, and whether we completed optional daily challenges. Apps that retain us past week 1 without external enforcement score higher on engagement. Apps where we had to force ourselves to hit day 21 have a design problem. Even if the underlying cognitive tasks are sound, users won't practice long enough to benefit.
2. In-App Score Trajectory
Every app tracks your scores on its own games. We record the opening score and the week-3 score for each game the app prioritizes. Improvements of roughly 20% or more within the app are typical for most games. This is mostly learning the task mechanics. Apps where scores plateau quickly (a week or two) suggest the difficulty ceiling is too low. Apps where scores keep climbing at week 3 suggest the content library has more depth.
3. Subjective Cognitive Experience
A short daily log: rate perceived mental sharpness (1-5) before and after each session, note any spillover. Moments during the day when something felt easier or harder than usual that might relate to the training. This is the weakest category methodologically because subjective reports are noisy and vulnerable to placebo effects. We include it because complete dismissal of subjective experience is also wrong, and patterns across three weeks can be informative even if any single day isn't.
4. External Task Performance (Transfer)
This is the category that separates honest reviews from hype. Pre/post scores on Digit Span, N-Back, and Stroop. If an app claims to improve working memory and Digit Span scores barely move, that claim doesn't hold. If an app claims to improve attention and Stroop scores stay flat, same story. Far transfer, improvement on tasks quite different from the training games. Is what matters for whether the app delivers real-world cognitive benefit. Near transfer (getting better at very similar tasks) is easy and usually happens. Far transfer is the rare and interesting finding.
How We Validate Advertised Claims
Every brain training app makes marketing claims on its store page and homepage. We screenshot the specific claims at the start of testing and check each one against our data. Three sources feed the validation:
- Our own testing data — the four metric categories above.
- Cited peer-reviewed research. Whatever the company lists on its “science” page. We read each paper and check whether it actually supports the marketing claim, or whether the study tested something narrower that was then over-generalized.
- Independent meta-analyses, particularly Simons et al. (2016) in Psychological Science in the Public Interest, which remains the most thorough independent review of the commercial brain training industry. Later papers that cite it or update its conclusions are also checked.
Claims that survive all three sources go in as verified. Claims that rely on company-funded studies using the company's own assessments as outcome measures are flagged as weak. Claims that contradict independent research are called out directly. This is how the Lumosity comparison ended up noting the FTC settlement. The marketing claims existed; the research didn't support them; we had to say so.
What This Methodology Can't Do
Being honest about limits is part of the methodology. Here's what our protocol can't tell you:
- Long-term effects. Three weeks doesn't tell you whether benefits persist after you stop using the app. The research on maintenance is thin and we don't try to extend our protocol there.
- Clinical populations. All our testing is on neurotypical adults without diagnosed cognitive conditions. Apps like CogniFit that target clinical use (ADHD, post-concussion, MCI) may produce different results in those populations. Our review can only speak to what we tested.
- Individual variation. One person's three weeks is a case study, not a trial. If we say Peak felt engaging and challenging, that's our experience. Your tolerance for specific game mechanics will vary.
- Placebo-controlled rigor. We can't blind ourselves to which app we're using, and expectations matter. The pre/post task scores help control for this a little, but not fully.
If these limits matter for your decision, you have a specific cognitive concern, you're considering long-term use, you want trial-grade evidence. The honest answer is to consult published research and talk to a clinician, not read a product review.
Apply This Protocol Yourself
If you're considering an app we haven't reviewed, here's the condensed 21-day version:
- Before installing: run Digit Span and Stroop from PsyToolkit three times each. Write down the scores.
- Screenshot the marketing page so you can check specific claims later.
- Days 1-7: follow default onboarding, 15 minutes daily, same time of day. Note which games you look forward to and which feel like a chore.
- Days 8-14: 20 minutes daily. Try the full content library, not just recommended sessions.
- Days 15-21: 25 minutes daily, focus on games targeting your weakest area per the app's assessment.
- Day 22: re-run Digit Span and Stroop. Compare to day 1.
- Compare claims to outcomes. Did the app do what its store page said?
If results are ambiguous (small, inconsistent gains that could be practice effects on the PsyToolkit tasks themselves), the honest interpretation is “inconclusive”, not “it works.” Cognitive transfer is difficult to demonstrate even in formal studies. Your at-home protocol won't resolve that.
Frequently Asked Questions
Why three weeks and not longer?
Three weeks is long enough to see past the initial novelty spike and get a real sense of the app's content depth and retention qualities. It's also short enough to run across multiple apps in a reasonable time and let us publish comparisons readers actually find useful. Formal trials often use 4-8 weeks, but those have participants doing nothing except the training. Our protocol mimics how real people fit apps into normal lives. Extending to 8 weeks tends to produce the same conclusions we reach at 21 days, just with more fatigue in our notes.
Are PsyToolkit tasks a fair baseline?
PsyToolkit hosts open-source versions of classic cognitive psychology tasks used in academic research. They're not perfect, no cognitive task is. But they have two advantages for our purposes: they don't overlap with the games any commercial brain training app uses, and they're free so anyone reading this review can replicate the baseline themselves. If an app improves performance on its own post-test, that tells us nothing because the app knows the test. If performance improves on PsyToolkit tasks it has never seen, that's at least plausible evidence of transfer.
What if I don't have time for 21 days?
The shortened version: play for 7 days, then take an honest inventory. If you dreaded opening the app by day 4, the content isn't good enough to build a habit around. No matter what the scientific claims are. If the games felt varied and challenging by day 7, the app probably has enough depth for a longer commitment. This won't tell you about cognitive transfer, but it will tell you whether the app is worth investing 21 days into. Most apps fail the 7-day test, which is a useful filter before the full protocol.
Can I trust any single review of a brain training app?
Single reviews are limited by definition. One reviewer's experience with three weeks of one app. The combination you actually want is: the reviewer's methodology (can they describe what they did day by day?), honesty about limits (do they admit what they can't conclude?), and consistency with independent research (do their findings align with or diverge from peer-reviewed meta-analyses?). A review that claims dramatic cognitive improvements after a week and doesn't mention the Simons et al. 2016 paper should be treated with suspicion. A review that carefully distinguishes between in-app performance gains and real-world transfer deserves more weight.