Skip to content

Technical Documentation

This section contains technical documentation for developers and maintainers of My Health Portal.

System Architecture

The platform is built as a containerized Python application with a documentation portal:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 Docker Container                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚         MkDocs Documentation Portal          โ”‚  โ”‚
โ”‚  โ”‚              (Port 8001)                     โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚         Python 3.11 Processing Layer         โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Parsers (PDF extraction)                  โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Analyzers (statistical analysis)          โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Exporters (CSV/JSON generation)           โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Visualizers (charts & graphs)             โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚           Data Storage Layer                 โ”‚  โ”‚
โ”‚  โ”‚  โ€ข data/raw/ (sensitive, git-ignored)        โ”‚  โ”‚
โ”‚  โ”‚  โ€ข data/processed/ (structured output)       โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core Components

Blood Test Parser

Location: src/parsers/blood_test_parser.py

Purpose: Extract structured data from PathCare laboratory PDF reports.

Key Features: - Multi-department section detection (Biochemistry, Haematology, Endocrinology, Serology, PCR) - Flexible regex-based test line parsing - Metadata extraction (dates, requisition numbers, doctors, labs) - Reference range parsing with min/max values - Flag detection (High/Low/Critical markers) - Delta value tracking - Exclusion list system for problematic files

Usage:

# Process first 5 PDFs (testing)
python src/parsers/blood_test_parser.py --limit 5

# Process all PDFs
python src/parsers/blood_test_parser.py --all

Output: - Individual JSON files per test session - Extraction summary with statistics - Detailed logging (console + file)

Data Consolidator

Location: src/exporters/consolidate_data.py

Purpose: Combine individual test JSONs into master datasets.

Outputs: - blood_tests.csv - Master dataset with all test results - metadata.csv - Test session metadata - biomarker_catalog.json - Frequency analysis of biomarkers

Usage:

python src/exporters/consolidate_data.py

Data Schemas

Individual Test JSON Schema

{
  "metadata": {
    "source_file": "filename.pdf",
    "collection_date": "2025-11-14",
    "collection_time": "14:20",
    "received_date": "2025-11-14",
    "received_time": "15:45",
    "requisition_number": "654391870",
    "patient_name": "DALING JAN",
    "patient_id": "8305145088089",
    "ordering_doctor": "DR J C H LAPORTA",
    "laboratory": "Christiaan Barnard Memorial Hospital"
  },
  "tests": [
    {
      "test_name": "HAEMOGLOBIN",
      "value": 14.2,
      "unit": "g/dL",
      "reference_range": "13.0 - 18.0",
      "reference_min": 13.0,
      "reference_max": 18.0,
      "flag": null,
      "department": "Ha HAEMATOLOGY",
      "delta_value": null,
      "delta_date": null
    }
  ]
}

Master CSV Schema

Column Type Description
collection_date date When sample was collected
collection_time time Collection time
test_name string Name of the test (normalized uppercase)
value float Numeric test result
unit string Unit of measurement (g/dL, IU/L, etc.)
reference_range string Normal range as text
reference_min float Lower bound of normal range
reference_max float Upper bound of normal range
flag string H (High), L (Low), * (Critical), or null
department string Lab department code + name
delta_value float Change from previous test
delta_date date Date of previous test
requisition_number string Lab requisition ID
ordering_doctor string Requesting physician
laboratory string Lab facility name
source_file string Original PDF filename

Biomarker Catalog Schema

{
  "HAEMOGLOBIN": {
    "count": 57,
    "first_seen": "2025-03-21",
    "last_seen": "2025-11-14",
    "units": ["g/dL"],
    "departments": ["Ha HAEMATOLOGY"]
  }
}

Parser Implementation Details

Section Detection

The parser identifies department sections using regex patterns:

section_patterns = [
    r'(Ch\s+[A-Z]+)',      # Ch BIOCHEMISTRY
    r'(Ha\s+[A-Z]+)',      # Ha Haematology
    r'(Pc\s+[A-Z\.\s]+)',  # Pc P.C.R. DEPARTMENT
    r'(Se\s+[A-Z]+)',      # Se SEROLOGY
    r'(En\s+[A-Z]+)',      # En Endocrinology
]

Test Line Parsing

The main regex pattern for extracting test results:

pattern = r'^([A-Za-z][A-Za-z\s\-()/%]+?)\s+([0-9.<>]+)\s*([HL*#])?\s+([\d.<>]+\s*[-โ€“]\s*[\d.<>]+|>=?\s*[\d.]+|<=?\s*[\d.]+)?\s*([a-zA-Z0-9/\-%]+)?'

Captures: 1. Test name (letters, spaces, hyphens, parentheses, slashes, percentages) 2. Numeric value (digits, decimal points, comparison operators) 3. Flag character (H, L, *, #) 4. Reference range (various formats) 5. Unit of measurement

Skip Patterns

Lines to ignore during parsing:

skip_patterns = [
    r'^\s*$',                    # Empty lines
    r'^[A-Z][a-z]',              # Section headers (e.g., "Ha HAEMATOLOGY")
    r'^[A-Z\s]+$',               # ALL CAPS LINES
    r'RED CELLS|WHITE CELLS',    # Subsection headers
]

Quality Metrics

Current Performance (v0.1.0)

  • Success Rate: 100% (77/77 PDFs processed)
  • Test Extraction: 1,845 tests across 71 sessions
  • Biomarkers Tracked: 131 unique markers
  • Date Coverage: 8 months (March - November 2025)
  • Excluded Files: 8 (documented in .exclude)

Validation Approach

  1. Automated Extraction: Parser processes all PDFs
  2. Summary Statistics: Test counts, biomarker frequencies
  3. Spot Checking: Manual verification of critical biomarkers
  4. Trend Analysis: Query specific markers to verify data integrity
  5. Flag Validation: Confirm HIGH/LOW markers match reference ranges

Exclusion List

Files in data/raw/blood-tests/.exclude are skipped during parsing:

# COVID test (different format)
2021-03-27.09_17.Daling, Jan-Marten-672099237-.pdf

# Corrupted PDF
2025-04-09.DALING_JANMARTEN-664471957(HA)-250409122955288.pdf

# Format variations requiring special handling
2025-04-24.2.Daling, Jan-Marten-664477954-.pdf
2025-07-11.12_39.Daling, Jan-Marten-664013903-.pdf
# ... (6 files with 0 tests extracted)

Development Workflow

Adding New PDFs

  1. Place PDFs in data/raw/blood-tests/
  2. Run parser: docker exec remission python src/parsers/blood_test_parser.py --all
  3. Consolidate: docker exec remission python src/exporters/consolidate_data.py
  4. Review extraction summary for any issues

Resetting Data

To clear processed data and re-import:

# Clean processed files
rm -f data/processed/*.csv
rm -f data/processed/*.json
rm -rf data/processed/individual_tests/*

# Re-run parser and consolidator
```bash
docker exec remission python src/parsers/blood_test_parser.py --all
docker exec remission python src/exporters/consolidate_data.py
### Querying Data

Use pandas to query the master CSV:

```python
import pandas as pd

# Load data
df = pd.read_csv('data/processed/blood_tests.csv')

# Query specific biomarker
crp = df[df['test_name'] == 'C-REACTIVE PROTEIN']
print(crp[['collection_date', 'value', 'unit', 'flag']])

# Filter by date range
recent = df[df['collection_date'] >= '2025-10-01']

# Flag analysis
high_flags = df[df['flag'] == 'H']

Technology Stack

Core Dependencies

  • Python 3.11 - Runtime environment
  • pdfplumber - PDF text extraction
  • pandas - Data manipulation and analysis
  • numpy - Numerical operations
  • MkDocs - Documentation generation
  • Material for MkDocs - Documentation theme

Development Tools

  • Docker & Docker Compose - Containerization
  • Git - Version control
  • logging - Application logging
  • json - Data serialization
  • pathlib - File path handling

Security & Privacy

Data Protection

  • All data/ directories are git-ignored
  • Patient information never committed to version control
  • PDFs contain sensitive health information (PHI)
  • Container isolates processing environment

Access Control

  • Repository is private (JMDaling/remission)
  • Docker container runs as non-root user
  • MkDocs documentation does not expose raw data
  • Processed CSVs stored locally only

Future Enhancements

Planned Features

  1. Visualization Module
  2. Time-series charts for biomarkers
  3. Correlation heatmaps
  4. Reference range overlay
  5. Treatment timeline integration

  6. Statistical Analysis

  7. Trend detection algorithms
  8. Outlier identification
  9. Predictive modeling
  10. Correlation analysis

  11. Treatment Tracking

  12. Medication schedule
  13. Dosage tracking
  14. Side effect logging
  15. Effectiveness correlation

  16. Alert System

  17. Concerning trend detection
  18. Reference range violations
  19. Test frequency reminders
  20. Doctor appointment triggers

  21. Multi-Lab Support

  22. Additional PDF formats
  23. Format detection
  24. Unified data model
  25. Lab comparison tools

Medical Timeline

Location: docs/health-data/timeline.md
Configuration: docs/config/medical-events.json

The medical timeline provides a visual, chronological view of key medical events with scaled date spacing.

Managing Timeline Events

Timeline events are managed in the JSON configuration file: docs/config/medical-events.json

Adding a New Event

Edit the JSON file and add a new event object to the events array:

{
  "date": "2025-12-01",
  "title": "Event Title",
  "description": "Event description",
  "category": "test",
  "documents": [
    {
      "name": "Document Name",
      "path": "../path/to/document.pdf"
    }
  ],
  "notes": "Optional notes or key metrics"
}

Event Properties

Property Type Required Description
date string Yes ISO date format (YYYY-MM-DD)
title string Yes Event title
description string Yes Brief description
category string Yes One of: diagnosis, treatment, test, appointment, procedure
documents array No Array of document objects with name and path
notes string No Additional notes or metrics

Event Categories

  • ๐Ÿ” diagnosis - Initial diagnosis and major findings
  • ๐Ÿ’Š treatment - Chemotherapy, medication, therapies
  • ๐Ÿงช test - Blood tests, imaging, lab results
  • ๐Ÿ‘จโ€โš•๏ธ appointment - Doctor visits, consultations
  • ๐Ÿฅ procedure - Surgeries, biopsies, procedures

Document Linking

Documents can be linked using relative paths from the timeline page:

  • Blood test JSONs: ../data/processed/individual_tests/YYYY-MM-DD_*.json
  • PDFs: ../data/raw/blood-tests/filename.pdf
  • Reports: Any relative path from docs/health-data/timeline.md

Custom Categories

To add a new category, edit the categories object in the JSON file:

"categories": {
  "custom": {
    "label": "Custom Category",
    "color": "#hexcolor",
    "icon": "๐Ÿ“Œ"
  }
}

Date Scaling

The timeline automatically scales spacing between events based on time differences: - Small gap (1-4 weeks): 2rem spacing - Medium gap (1-3 months): 4rem spacing - Large gap (3+ months): 6rem spacing


Version: 0.1.0
Last Updated: 2025-11-16
Maintainer: Jan-Marten Daling