Technical Documentation
This section contains technical documentation for developers and maintainers of My Health Portal.
Quick Links
- ๐ Page Metadata & Data Freshness - Automated tracking system for documentation updates
- ๐ Data Status Report - Current data freshness and page dependencies
System Architecture
The platform is built as a containerized Python application with a documentation portal:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Docker Container โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MkDocs Documentation Portal โ โ
โ โ (Port 8001) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Python 3.11 Processing Layer โ โ
โ โ โข Parsers (PDF extraction) โ โ
โ โ โข Analyzers (statistical analysis) โ โ
โ โ โข Exporters (CSV/JSON generation) โ โ
โ โ โข Visualizers (charts & graphs) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Data Storage Layer โ โ
โ โ โข data/raw/ (sensitive, git-ignored) โ โ
โ โ โข data/processed/ (structured output) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Components
Blood Test Parser
Location: src/parsers/blood_test_parser.py
Purpose: Extract structured data from PathCare laboratory PDF reports.
Key Features: - Multi-department section detection (Biochemistry, Haematology, Endocrinology, Serology, PCR) - Flexible regex-based test line parsing - Metadata extraction (dates, requisition numbers, doctors, labs) - Reference range parsing with min/max values - Flag detection (High/Low/Critical markers) - Delta value tracking - Exclusion list system for problematic files
Usage:
# Process first 5 PDFs (testing)
python src/parsers/blood_test_parser.py --limit 5
# Process all PDFs
python src/parsers/blood_test_parser.py --all
Output: - Individual JSON files per test session - Extraction summary with statistics - Detailed logging (console + file)
Data Consolidator
Location: src/exporters/consolidate_data.py
Purpose: Combine individual test JSONs into master datasets.
Outputs:
- blood_tests.csv - Master dataset with all test results
- metadata.csv - Test session metadata
- biomarker_catalog.json - Frequency analysis of biomarkers
Usage:
python src/exporters/consolidate_data.py
Data Schemas
Individual Test JSON Schema
{
"metadata": {
"source_file": "filename.pdf",
"collection_date": "2025-11-14",
"collection_time": "14:20",
"received_date": "2025-11-14",
"received_time": "15:45",
"requisition_number": "654391870",
"patient_name": "DALING JAN",
"patient_id": "8305145088089",
"ordering_doctor": "DR J C H LAPORTA",
"laboratory": "Christiaan Barnard Memorial Hospital"
},
"tests": [
{
"test_name": "HAEMOGLOBIN",
"value": 14.2,
"unit": "g/dL",
"reference_range": "13.0 - 18.0",
"reference_min": 13.0,
"reference_max": 18.0,
"flag": null,
"department": "Ha HAEMATOLOGY",
"delta_value": null,
"delta_date": null
}
]
}
Master CSV Schema
| Column | Type | Description |
|---|---|---|
collection_date |
date | When sample was collected |
collection_time |
time | Collection time |
test_name |
string | Name of the test (normalized uppercase) |
value |
float | Numeric test result |
unit |
string | Unit of measurement (g/dL, IU/L, etc.) |
reference_range |
string | Normal range as text |
reference_min |
float | Lower bound of normal range |
reference_max |
float | Upper bound of normal range |
flag |
string | H (High), L (Low), * (Critical), or null |
department |
string | Lab department code + name |
delta_value |
float | Change from previous test |
delta_date |
date | Date of previous test |
requisition_number |
string | Lab requisition ID |
ordering_doctor |
string | Requesting physician |
laboratory |
string | Lab facility name |
source_file |
string | Original PDF filename |
Biomarker Catalog Schema
{
"HAEMOGLOBIN": {
"count": 57,
"first_seen": "2025-03-21",
"last_seen": "2025-11-14",
"units": ["g/dL"],
"departments": ["Ha HAEMATOLOGY"]
}
}
Parser Implementation Details
Section Detection
The parser identifies department sections using regex patterns:
section_patterns = [
r'(Ch\s+[A-Z]+)', # Ch BIOCHEMISTRY
r'(Ha\s+[A-Z]+)', # Ha Haematology
r'(Pc\s+[A-Z\.\s]+)', # Pc P.C.R. DEPARTMENT
r'(Se\s+[A-Z]+)', # Se SEROLOGY
r'(En\s+[A-Z]+)', # En Endocrinology
]
Test Line Parsing
The main regex pattern for extracting test results:
pattern = r'^([A-Za-z][A-Za-z\s\-()/%]+?)\s+([0-9.<>]+)\s*([HL*#])?\s+([\d.<>]+\s*[-โ]\s*[\d.<>]+|>=?\s*[\d.]+|<=?\s*[\d.]+)?\s*([a-zA-Z0-9/\-%]+)?'
Captures: 1. Test name (letters, spaces, hyphens, parentheses, slashes, percentages) 2. Numeric value (digits, decimal points, comparison operators) 3. Flag character (H, L, *, #) 4. Reference range (various formats) 5. Unit of measurement
Skip Patterns
Lines to ignore during parsing:
skip_patterns = [
r'^\s*$', # Empty lines
r'^[A-Z][a-z]', # Section headers (e.g., "Ha HAEMATOLOGY")
r'^[A-Z\s]+$', # ALL CAPS LINES
r'RED CELLS|WHITE CELLS', # Subsection headers
]
Quality Metrics
Current Performance (v0.1.0)
- Success Rate: 100% (77/77 PDFs processed)
- Test Extraction: 1,845 tests across 71 sessions
- Biomarkers Tracked: 131 unique markers
- Date Coverage: 8 months (March - November 2025)
- Excluded Files: 8 (documented in
.exclude)
Validation Approach
- Automated Extraction: Parser processes all PDFs
- Summary Statistics: Test counts, biomarker frequencies
- Spot Checking: Manual verification of critical biomarkers
- Trend Analysis: Query specific markers to verify data integrity
- Flag Validation: Confirm HIGH/LOW markers match reference ranges
Exclusion List
Files in data/raw/blood-tests/.exclude are skipped during parsing:
# COVID test (different format)
2021-03-27.09_17.Daling, Jan-Marten-672099237-.pdf
# Corrupted PDF
2025-04-09.DALING_JANMARTEN-664471957(HA)-250409122955288.pdf
# Format variations requiring special handling
2025-04-24.2.Daling, Jan-Marten-664477954-.pdf
2025-07-11.12_39.Daling, Jan-Marten-664013903-.pdf
# ... (6 files with 0 tests extracted)
Development Workflow
Adding New PDFs
- Place PDFs in
data/raw/blood-tests/ - Run parser:
docker exec remission python src/parsers/blood_test_parser.py --all - Consolidate:
docker exec remission python src/exporters/consolidate_data.py - Review extraction summary for any issues
Resetting Data
To clear processed data and re-import:
# Clean processed files
rm -f data/processed/*.csv
rm -f data/processed/*.json
rm -rf data/processed/individual_tests/*
# Re-run parser and consolidator
```bash
docker exec remission python src/parsers/blood_test_parser.py --all
docker exec remission python src/exporters/consolidate_data.py
### Querying Data
Use pandas to query the master CSV:
```python
import pandas as pd
# Load data
df = pd.read_csv('data/processed/blood_tests.csv')
# Query specific biomarker
crp = df[df['test_name'] == 'C-REACTIVE PROTEIN']
print(crp[['collection_date', 'value', 'unit', 'flag']])
# Filter by date range
recent = df[df['collection_date'] >= '2025-10-01']
# Flag analysis
high_flags = df[df['flag'] == 'H']
Technology Stack
Core Dependencies
- Python 3.11 - Runtime environment
- pdfplumber - PDF text extraction
- pandas - Data manipulation and analysis
- numpy - Numerical operations
- MkDocs - Documentation generation
- Material for MkDocs - Documentation theme
Development Tools
- Docker & Docker Compose - Containerization
- Git - Version control
- logging - Application logging
- json - Data serialization
- pathlib - File path handling
Security & Privacy
Data Protection
- All
data/directories are git-ignored - Patient information never committed to version control
- PDFs contain sensitive health information (PHI)
- Container isolates processing environment
Access Control
- Repository is private (JMDaling/remission)
- Docker container runs as non-root user
- MkDocs documentation does not expose raw data
- Processed CSVs stored locally only
Future Enhancements
Planned Features
- Visualization Module
- Time-series charts for biomarkers
- Correlation heatmaps
- Reference range overlay
-
Treatment timeline integration
-
Statistical Analysis
- Trend detection algorithms
- Outlier identification
- Predictive modeling
-
Correlation analysis
-
Treatment Tracking
- Medication schedule
- Dosage tracking
- Side effect logging
-
Effectiveness correlation
-
Alert System
- Concerning trend detection
- Reference range violations
- Test frequency reminders
-
Doctor appointment triggers
-
Multi-Lab Support
- Additional PDF formats
- Format detection
- Unified data model
- Lab comparison tools
Medical Timeline
Location: docs/health-data/timeline.md
Configuration: docs/config/medical-events.json
The medical timeline provides a visual, chronological view of key medical events with scaled date spacing.
Managing Timeline Events
Timeline events are managed in the JSON configuration file: docs/config/medical-events.json
Adding a New Event
Edit the JSON file and add a new event object to the events array:
{
"date": "2025-12-01",
"title": "Event Title",
"description": "Event description",
"category": "test",
"documents": [
{
"name": "Document Name",
"path": "../path/to/document.pdf"
}
],
"notes": "Optional notes or key metrics"
}
Event Properties
| Property | Type | Required | Description |
|---|---|---|---|
date |
string | Yes | ISO date format (YYYY-MM-DD) |
title |
string | Yes | Event title |
description |
string | Yes | Brief description |
category |
string | Yes | One of: diagnosis, treatment, test, appointment, procedure |
documents |
array | No | Array of document objects with name and path |
notes |
string | No | Additional notes or metrics |
Event Categories
- ๐ diagnosis - Initial diagnosis and major findings
- ๐ treatment - Chemotherapy, medication, therapies
- ๐งช test - Blood tests, imaging, lab results
- ๐จโโ๏ธ appointment - Doctor visits, consultations
- ๐ฅ procedure - Surgeries, biopsies, procedures
Document Linking
Documents can be linked using relative paths from the timeline page:
- Blood test JSONs:
../data/processed/individual_tests/YYYY-MM-DD_*.json - PDFs:
../data/raw/blood-tests/filename.pdf - Reports: Any relative path from
docs/health-data/timeline.md
Custom Categories
To add a new category, edit the categories object in the JSON file:
"categories": {
"custom": {
"label": "Custom Category",
"color": "#hexcolor",
"icon": "๐"
}
}
Date Scaling
The timeline automatically scales spacing between events based on time differences: - Small gap (1-4 weeks): 2rem spacing - Medium gap (1-3 months): 4rem spacing - Large gap (3+ months): 6rem spacing
Version: 0.1.0
Last Updated: 2025-11-16
Maintainer: Jan-Marten Daling