# San Francisco Restaurant Inspection System - Complete Knowledge Base
## Last Updated: August 16, 2025

---

## 🎯 System Overview

The San Francisco Restaurant Inspection System is a fully automated data collection and article generation pipeline that:
1. **Collects** inspection data from SF MyHealthDepartment portal (Windows laptop with residential IP)
2. **Uploads** data to Linux server via secure API
3. **Processes** PDFs and extracts structured data
4. **Generates** educational news articles using AI
5. **Publishes** content to CleanKitchens.org

### Key Achievement
- **Bypassed 403 Blocking**: Site blocks datacenter IPs; solution uses Windows laptop with residential IP
- **Fully Automated**: Runs 6 times daily via Windows Task Scheduler
- **36 Inspections Collected**: Successfully downloading and processing SF restaurant inspections
- **Smart Incremental Collection**: Doesn't restart from beginning, finds last processed inspection

---

## 📁 Directory Structure

```
/var/www/twin-digital-media/public_html/_sites/cleankitchens/data/sf/
├── SF_Smart_Collector_Fresh.bat    # Active Windows collector (v3 - PRODUCTION)
├── api_receiver_v2.php             # Server API endpoint
├── api_log.txt                     # API activity log
├── pdfs/                           # 36 inspection PDFs
├── json/                           # 36 inspection JSON files
├── generated_articles/             # AI-generated news articles
├── scripts/
│   └── sf-article-processor.py    # Article processor
├── sf_two_phase_generator.py      # Two-phase story generator
├── sf_weaviate_processor.py       # Weaviate vector DB integration
├── process_sf_pdfs.py             # PDF text extraction
└── SF_KNOWLEDGE_BASE.md           # This documentation
```

---

## 🔄 Data Collection Pipeline

### Phase 1: Windows Collection (SF_Smart_Collector_Fresh.bat)

**Schedule:** 6 times daily (12AM, 4AM, 8AM, 12PM, 4PM, 8PM)
**Location:** Windows laptop with residential IP
**Process:**
1. Opens Chrome browser (visible, not headless)
2. Navigates to https://inspections.myhealthdepartment.com/san-francisco
3. Clicks "Show More" 5 times on first run to load historical data
4. Searches for last processed inspection (smart incremental)
5. Downloads up to 10 inspections per run (rate limited)
6. Uploads to server via API with Base64-encoded PDFs
7. Tracks progress in `%LOCALAPPDATA%\SFCollector\tracker.json`

**Key Features:**
- Random wait times (2-5 seconds) to appear human
- Stale element handling for browser navigation
- UTF-8 encoding for Windows compatibility
- Error logging and retry queue

### Phase 2: Server Reception (api_receiver_v2.php)

**Endpoint:** https://cleankitchens.org/data/sf/api_receiver_v2.php
**Authentication:** X-API-Key header
**Process:**
1. Validates API key
2. Receives Base64-encoded PDF
3. Saves PDF to `/data/sf/pdfs/`
4. Saves JSON metadata to `/data/sf/json/`
5. Logs activity to `api_log.txt`

---

## 📊 Data Processing Scripts

### 1. PDF Text Extraction (process_sf_pdfs.py)
- Extracts text from downloaded PDFs
- Uses PyPDF2 for text extraction
- Handles special characters and formatting
- Outputs structured text for analysis

### 2. Weaviate Integration (sf_weaviate_processor.py)
- Vectorizes inspection data for semantic search
- Stores in Weaviate vector database
- Enables similarity searches
- Links inspections to articles

### 3. Article Generation (sf_two_phase_generator.py)
**Two-Phase Approach:**
- **Phase 1:** Extract key facts from inspection PDF
  - Restaurant name, address, date
  - Violations found
  - Score/grade
  - Critical issues

- **Phase 2:** Generate educational article
  - Uses Claude-3.5-Sonnet API
  - Incorporates FDA/CDC guidelines
  - Adds local context (neighborhoods, transit)
  - Creates SEO-optimized content
  - Generates structured data

**Output Format:**
```json
{
  "title": "Restaurant Name: Date Health Inspection Results",
  "content": "Educational article content...",
  "meta_description": "SEO description",
  "violations": ["violation1", "violation2"],
  "score": 85,
  "structured_data": {...}
}
```

---

## 🗓️ Scheduling & Automation

### Current Schedule (Windows Task Scheduler)
- **SF_Smart_Collector_12AM** - Midnight collection
- **SF_Smart_Collector_4AM** - Early morning
- **SF_Smart_Collector_8AM** - Morning
- **SF_Smart_Collector_12PM** - Noon
- **SF_Smart_Collector_4PM** - Afternoon
- **SF_Smart_Collector_8PM** - Evening

### Manual Execution
- Desktop shortcut: `SF_Smart_Manual.bat`
- Direct script: `%LOCALAPPDATA%\SFCollector\sf_smart_collector.py`

---

## 🔐 Security & Configuration

### API Configuration
```php
// api_receiver_v2.php
define('API_KEY', 'sk-sf-inspections-2025');
```

```python
# SF_Smart_Collector_Fresh.bat (embedded Python)
API_KEY = 'sk-sf-inspections-2025'
SERVER_URL = 'https://cleankitchens.org/data/sf/api_receiver_v2.php'
```

### Rate Limiting
- Max 10 inspections per run
- 2-5 second random delays between actions
- 6 runs daily = ~60 inspections/day capacity

---

## 📈 Data Statistics

### Collection Status (as of Aug 16, 2025)
- **Total PDFs:** 36
- **Total JSON:** 36
- **Date Range:** Recent inspections
- **Storage Used:** ~20MB
- **Articles Generated:** 5 test articles

### Processing Metrics
- **PDF Processing Time:** ~2 seconds per file
- **Article Generation:** ~15 seconds per article
- **API Upload:** ~5 seconds per inspection
- **Total Pipeline:** ~22 seconds per inspection

---

## 🚧 TODO - Pending Implementation

### 1. **Article Generation Scheduling** 🔴 HIGH PRIORITY
```bash
# Need to add to crontab
# Run daily at 2 AM to process previous day's inspections
0 2 * * * /home/chris/cleankitchens-env/bin/python /var/www/twin-digital-media/public_html/_sites/cleankitchens/data/sf/sf_two_phase_generator.py

# Considerations before scheduling:
# - Ensure site template is ready for SF articles
# - Test article quality with manual runs
# - Set up monitoring for generation failures
# - Configure email notifications
```

### 2. **Site Adjustments Needed**
- [ ] Create SF-specific article template
- [ ] Add SF neighborhood data
- [ ] Update navigation for SF section
- [ ] Configure URL routing for SF articles
- [ ] Test structured data validation
- [ ] Set up SF-specific images

### 3. **Future Enhancements**
- [ ] Historical data backfill (collect older inspections)
- [ ] Duplicate detection improvements
- [ ] Article update mechanism (for re-inspections)
- [ ] Statistics dashboard
- [ ] Email digest of new violations

---

## 🐛 Known Issues & Solutions

### Issue: Site blocks datacenter IPs (403 Forbidden)
**Solution:** Use Windows laptop with residential IP

### Issue: Unicode encoding errors on Windows
**Solution:** Replace Unicode symbols with ASCII ([SUCCESS], [FAILED])

### Issue: Stale element errors in Selenium
**Solution:** Re-navigate to main page after each inspection

### Issue: PDFs saving to wrong directory
**Solution:** Updated api_receiver_v2.php to use `/data/sf/pdfs/`

---

## 📝 Maintenance Notes

### Daily Checks
1. Verify Task Scheduler is running (Windows)
2. Check api_log.txt for successful uploads
3. Monitor PDF/JSON directories for new files
4. Review error logs if any

### Weekly Tasks
1. Clear old logs if > 100MB
2. Backup inspection data
3. Test article generation manually
4. Check for site updates that might break scraper

### Monthly Tasks
1. Review and optimize collection schedule
2. Analyze inspection trends
3. Update this knowledge base
4. Plan feature improvements

---

## 📞 Technical Details

### Dependencies
**Windows Collector:**
- Python 3.13+
- Selenium
- Requests
- Webdriver-manager
- Chrome browser

**Server Processing:**
- PHP 7.4+
- Python 3.9+
- PyPDF2
- Weaviate client
- Anthropic API (Claude)

### Performance Metrics
- **Collection Rate:** ~10 inspections per 20 minutes
- **Server Processing:** < 1 second per upload
- **Storage Growth:** ~500KB per inspection
- **Bandwidth Usage:** ~5MB per collection run

---

## 🎯 Success Metrics

### Current Achievements
✅ Automated collection running 6x daily
✅ 36 inspections collected and stored
✅ Smart incremental collection working
✅ API upload pipeline functional
✅ Article generation tested successfully

### Next Milestones
⏳ Schedule automated article generation
⏳ Launch SF section on CleanKitchens.org
⏳ Collect 1,000+ inspections
⏳ Generate 100+ articles
⏳ Achieve page 1 Google rankings for SF restaurant inspections

---

*This knowledge base is the single source of truth for the SF inspection system. Update it whenever significant changes are made to the collection, processing, or generation pipeline.*