Technical Documentation

This section contains technical documentation for developers and maintainers of My Health Portal.

Quick Links

📊 Page Metadata & Data Freshness - Automated tracking system for documentation updates
📈 Data Status Report - Current data freshness and page dependencies

System Architecture

The platform is built as a containerized Python application with a documentation portal:

┌─────────────────────────────────────────────────────┐
│                 Docker Container                    │
│  ┌──────────────────────────────────────────────┐  │
│  │         MkDocs Documentation Portal          │  │
│  │              (Port 8001)                     │  │
│  └──────────────────────────────────────────────┘  │
│                                                     │
│  ┌──────────────────────────────────────────────┐  │
│  │         Python 3.11 Processing Layer         │  │
│  │  • Parsers (PDF extraction)                  │  │
│  │  • Analyzers (statistical analysis)          │  │
│  │  • Exporters (CSV/JSON generation)           │  │
│  │  • Visualizers (charts & graphs)             │  │
│  └──────────────────────────────────────────────┘  │
│                                                     │
│  ┌──────────────────────────────────────────────┐  │
│  │           Data Storage Layer                 │  │
│  │  • data/raw/ (sensitive, git-ignored)        │  │
│  │  • data/processed/ (structured output)       │  │
│  └──────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Core Components

Blood Test Parser

Location: src/parsers/blood_test_parser.py

Purpose: Extract structured data from PathCare laboratory PDF reports.

Key Features: - Multi-department section detection (Biochemistry, Haematology, Endocrinology, Serology, PCR) - Flexible regex-based test line parsing - Metadata extraction (dates, requisition numbers, doctors, labs) - Reference range parsing with min/max values - Flag detection (High/Low/Critical markers) - Delta value tracking - Exclusion list system for problematic files

Usage:

# Process first 5 PDFs (testing)
python src/parsers/blood_test_parser.py --limit 5

# Process all PDFs
python src/parsers/blood_test_parser.py --all

Output: - Individual JSON files per test session - Extraction summary with statistics - Detailed logging (console + file)

Data Consolidator

Location: src/exporters/consolidate_data.py

Purpose: Combine individual test JSONs into master datasets.

Outputs: - blood_tests.csv - Master dataset with all test results - metadata.csv - Test session metadata - biomarker_catalog.json - Frequency analysis of biomarkers

Usage:

python src/exporters/consolidate_data.py

Data Schemas

Individual Test JSON Schema

{
  "metadata": {
    "source_file": "filename.pdf",
    "collection_date": "2025-11-14",
    "collection_time": "14:20",
    "received_date": "2025-11-14",
    "received_time": "15:45",
    "requisition_number": "654391870",
    "patient_name": "DALING JAN",
    "patient_id": "8305145088089",
    "ordering_doctor": "DR J C H LAPORTA",
    "laboratory": "Christiaan Barnard Memorial Hospital"
  },
  "tests": [
    {
      "test_name": "HAEMOGLOBIN",
      "value": 14.2,
      "unit": "g/dL",
      "reference_range": "13.0 - 18.0",
      "reference_min": 13.0,
      "reference_max": 18.0,
      "flag": null,
      "department": "Ha HAEMATOLOGY",
      "delta_value": null,
      "delta_date": null
    }
  ]
}

Master CSV Schema

Column	Type	Description
`collection_date`	date	When sample was collected
`collection_time`	time	Collection time
`test_name`	string	Name of the test (normalized uppercase)
`value`	float	Numeric test result
`unit`	string	Unit of measurement (g/dL, IU/L, etc.)
`reference_range`	string	Normal range as text
`reference_min`	float	Lower bound of normal range
`reference_max`	float	Upper bound of normal range
`flag`	string	H (High), L (Low), * (Critical), or null
`department`	string	Lab department code + name
`delta_value`	float	Change from previous test
`delta_date`	date	Date of previous test
`requisition_number`	string	Lab requisition ID
`ordering_doctor`	string	Requesting physician
`laboratory`	string	Lab facility name
`source_file`	string	Original PDF filename

Biomarker Catalog Schema

{
  "HAEMOGLOBIN": {
    "count": 57,
    "first_seen": "2025-03-21",
    "last_seen": "2025-11-14",
    "units": ["g/dL"],
    "departments": ["Ha HAEMATOLOGY"]
  }
}

Parser Implementation Details

Section Detection

The parser identifies department sections using regex patterns:

section_patterns = [
    r'(Ch\s+[A-Z]+)',      # Ch BIOCHEMISTRY
    r'(Ha\s+[A-Z]+)',      # Ha Haematology
    r'(Pc\s+[A-Z\.\s]+)',  # Pc P.C.R. DEPARTMENT
    r'(Se\s+[A-Z]+)',      # Se SEROLOGY
    r'(En\s+[A-Z]+)',      # En Endocrinology
]

Test Line Parsing

The main regex pattern for extracting test results:

pattern = r'^([A-Za-z][A-Za-z\s\-()/%]+?)\s+([0-9.<>]+)\s*([HL*#])?\s+([\d.<>]+\s*[-–]\s*[\d.<>]+|>=?\s*[\d.]+|<=?\s*[\d.]+)?\s*([a-zA-Z0-9/\-%]+)?'

Captures: 1. Test name (letters, spaces, hyphens, parentheses, slashes, percentages) 2. Numeric value (digits, decimal points, comparison operators) 3. Flag character (H, L, *, #) 4. Reference range (various formats) 5. Unit of measurement

Skip Patterns

Lines to ignore during parsing:

skip_patterns = [
    r'^\s*$',                    # Empty lines
    r'^[A-Z][a-z]',              # Section headers (e.g., "Ha HAEMATOLOGY")
    r'^[A-Z\s]+$',               # ALL CAPS LINES
    r'RED CELLS|WHITE CELLS',    # Subsection headers
]

Quality Metrics

Current Performance (v0.1.0)

Success Rate: 100% (77/77 PDFs processed)
Test Extraction: 1,845 tests across 71 sessions
Biomarkers Tracked: 131 unique markers
Date Coverage: 8 months (March - November 2025)
Excluded Files: 8 (documented in .exclude)

Validation Approach

Automated Extraction: Parser processes all PDFs
Summary Statistics: Test counts, biomarker frequencies
Spot Checking: Manual verification of critical biomarkers
Trend Analysis: Query specific markers to verify data integrity
Flag Validation: Confirm HIGH/LOW markers match reference ranges

Exclusion List

Files in data/raw/blood-tests/.exclude are skipped during parsing:

# COVID test (different format)
2021-03-27.09_17.Daling, Jan-Marten-672099237-.pdf

# Corrupted PDF
2025-04-09.DALING_JANMARTEN-664471957(HA)-250409122955288.pdf

# Format variations requiring special handling
2025-04-24.2.Daling, Jan-Marten-664477954-.pdf
2025-07-11.12_39.Daling, Jan-Marten-664013903-.pdf
# ... (6 files with 0 tests extracted)

Development Workflow

Adding New PDFs

Place PDFs in data/raw/blood-tests/
Run parser: docker exec remission python src/parsers/blood_test_parser.py --all
Consolidate: docker exec remission python src/exporters/consolidate_data.py
Review extraction summary for any issues

Resetting Data

To clear processed data and re-import:

# Clean processed files
rm -f data/processed/*.csv
rm -f data/processed/*.json
rm -rf data/processed/individual_tests/*

# Re-run parser and consolidator
```bash
docker exec remission python src/parsers/blood_test_parser.py --all
docker exec remission python src/exporters/consolidate_data.py

### Querying Data

Use pandas to query the master CSV:

```python
import pandas as pd

# Load data
df = pd.read_csv('data/processed/blood_tests.csv')

# Query specific biomarker
crp = df[df['test_name'] == 'C-REACTIVE PROTEIN']
print(crp[['collection_date', 'value', 'unit', 'flag']])

# Filter by date range
recent = df[df['collection_date'] >= '2025-10-01']

# Flag analysis
high_flags = df[df['flag'] == 'H']

Technology Stack

Core Dependencies

Python 3.11 - Runtime environment
pdfplumber - PDF text extraction
pandas - Data manipulation and analysis
numpy - Numerical operations
MkDocs - Documentation generation
Material for MkDocs - Documentation theme

Development Tools

Docker & Docker Compose - Containerization
Git - Version control
logging - Application logging
json - Data serialization
pathlib - File path handling

Security & Privacy

Data Protection

All data/ directories are git-ignored
Patient information never committed to version control
PDFs contain sensitive health information (PHI)
Container isolates processing environment

Access Control

Repository is private (JMDaling/remission)
Docker container runs as non-root user
MkDocs documentation does not expose raw data
Processed CSVs stored locally only

Future Enhancements

Planned Features

Visualization Module
Time-series charts for biomarkers
Correlation heatmaps
Reference range overlay
Treatment timeline integration
Statistical Analysis
Trend detection algorithms
Outlier identification
Predictive modeling
Correlation analysis
Treatment Tracking
Medication schedule
Dosage tracking
Side effect logging
Effectiveness correlation
Alert System
Concerning trend detection
Reference range violations
Test frequency reminders
Doctor appointment triggers
Multi-Lab Support
Additional PDF formats
Format detection
Unified data model
Lab comparison tools

Medical Timeline

Location: docs/health-data/timeline.md
Configuration: docs/config/medical-events.json

The medical timeline provides a visual, chronological view of key medical events with scaled date spacing.

Managing Timeline Events

Timeline events are managed in the JSON configuration file: docs/config/medical-events.json

Adding a New Event

Edit the JSON file and add a new event object to the events array:

{
  "date": "2025-12-01",
  "title": "Event Title",
  "description": "Event description",
  "category": "test",
  "documents": [
    {
      "name": "Document Name",
      "path": "../path/to/document.pdf"
    }
  ],
  "notes": "Optional notes or key metrics"
}

Event Properties

Property	Type	Required	Description
`date`	string	Yes	ISO date format (YYYY-MM-DD)
`title`	string	Yes	Event title
`description`	string	Yes	Brief description
`category`	string	Yes	One of: diagnosis, treatment, test, appointment, procedure
`documents`	array	No	Array of document objects with `name` and `path`
`notes`	string	No	Additional notes or metrics

Event Categories

🔍 diagnosis - Initial diagnosis and major findings
💊 treatment - Chemotherapy, medication, therapies
🧪 test - Blood tests, imaging, lab results
👨‍⚕️ appointment - Doctor visits, consultations
🏥 procedure - Surgeries, biopsies, procedures

Document Linking

Documents can be linked using relative paths from the timeline page:

Blood test JSONs: ../data/processed/individual_tests/YYYY-MM-DD_*.json
PDFs: ../data/raw/blood-tests/filename.pdf
Reports: Any relative path from docs/health-data/timeline.md

Custom Categories

To add a new category, edit the categories object in the JSON file:

"categories": {
  "custom": {
    "label": "Custom Category",
    "color": "#hexcolor",
    "icon": "📌"
  }
}

Date Scaling

The timeline automatically scales spacing between events based on time differences: - Small gap (1-4 weeks): 2rem spacing - Medium gap (1-3 months): 4rem spacing - Large gap (3+ months): 6rem spacing

Version: 0.1.0
Last Updated: 2025-11-16
Maintainer: Jan-Marten Daling