# Filter Pipeline Documentation ## Overview BalanceBoard's Filter Pipeline is a plugin-based content filtering system that provides intelligent categorization, moderation, and ranking of posts using AI-powered analysis with aggressive caching for cost efficiency. ## Architecture ### Three-Level Caching System 1. **Level 1: Memory Cache** (5-minute TTL) - In-memory cache for fast repeated access - Cleared on application restart 2. **Level 2: AI Analysis Cache** (Permanent, content-hash based) - Stores AI results (categorization, moderation, quality scores) - Keyed by SHA-256 hash of content - Never expires - same content always returns cached results - **Huge cost savings**: Never re-analyze the same content 3. **Level 3: Filterset Results Cache** (24-hour TTL) - Stores final filter results per filterset - Invalidated when filterset definition changes - Enables instant filterset switching ### Pipeline Stages Posts flow through 4 sequential stages: ``` Raw Post ↓ 1. Categorizer (AI: topic detection, tags) ↓ 2. Moderator (AI: safety, quality, sentiment) ↓ 3. Filter (Rules: apply filterset conditions) ↓ 4. Ranker (Score: quality + recency + source + engagement) ↓ Filtered & Scored Post ``` #### Stage 1: Categorizer - **Purpose**: Detect topics and assign categories - **AI Model**: Llama 70B (cheap model) - **Caching**: Permanent (by content hash) - **Output**: Categories, category scores, tags #### Stage 2: Moderator - **Purpose**: Safety and quality analysis - **AI Model**: Llama 70B (cheap model) - **Caching**: Permanent (by content hash) - **Metrics**: - Violence score (0.0-1.0) - Sexual content score (0.0-1.0) - Hate speech score (0.0-1.0) - Harassment score (0.0-1.0) - Quality score (0.0-1.0) - Sentiment (positive/neutral/negative) #### Stage 3: Filter - **Purpose**: Apply filterset rules - **AI**: None (fast rule evaluation) - **Rules Supported**: - `equals`, `not_equals` - `in`, `not_in` - `min`, `max` - `includes_any`, `excludes` #### Stage 4: Ranker - **Purpose**: Calculate relevance scores - **Scoring Factors**: - Quality (30%): From Moderator stage - Recency (25%): Age-based decay - Source Tier (25%): Platform reputation - Engagement (20%): Upvotes + comments ## Configuration ### filter_config.json ```json { "ai": { "enabled": false, "openrouter_key_file": "openrouter_key.txt", "models": { "cheap": "meta-llama/llama-3.3-70b-instruct", "smart": "meta-llama/llama-3.3-70b-instruct" }, "parallel_workers": 10, "timeout_seconds": 60 }, "cache": { "enabled": true, "ai_cache_dir": "data/filter_cache", "filterset_cache_ttl_hours": 24 }, "pipeline": { "default_stages": ["categorizer", "moderator", "filter", "ranker"], "batch_size": 50, "enable_parallel": true } } ``` ### filtersets.json Each filterset defines filtering rules: ```json { "safe_content": { "description": "Filter for safe, family-friendly content", "post_rules": { "moderation.flags.is_safe": {"equals": true}, "moderation.content_safety.violence": {"max": 0.3}, "moderation.content_safety.sexual_content": {"max": 0.2}, "moderation.content_safety.hate_speech": {"max": 0.1} }, "comment_rules": { "moderation.flags.is_safe": {"equals": true} } } } ``` ## Usage ### User Perspective 1. Navigate to **Settings → Filters** 2. Select a filterset from the dropdown 3. Save preferences 4. Feed automatically applies your filterset 5. Posts sorted by relevance score (highest first) ### Developer Perspective ```python from filter_pipeline import FilterEngine # Get singleton instance engine = FilterEngine.get_instance() # Apply filterset to posts filtered_posts = engine.apply_filterset( posts=raw_posts, filterset_name='safe_content', use_cache=True ) # Access filter metadata for post in filtered_posts: score = post['_filter_score'] # 0.0-1.0 categories = post['_filter_categories'] # ['technology', 'programming'] tags = post['_filter_tags'] # ['reddit', 'python'] ``` ## AI Integration ### Enabling AI 1. **Get OpenRouter API Key**: - Sign up at https://openrouter.ai - Generate API key 2. **Configure**: ```bash echo "your-api-key-here" > openrouter_key.txt ``` 3. **Enable in config**: ```json { "ai": { "enabled": true } } ``` 4. **Restart application** ### Cost Efficiency - **Model**: Llama 70B only (~$0.0003/1K tokens) - **Caching**: Permanent AI result cache - **Estimate**: ~$0.001 per post (first time), $0 (cached) - **10,000 posts**: ~$10 first time, ~$0 cached ## Performance ### Benchmarks - **With Cache Hit**: < 10ms per post - **With Cache Miss (AI)**: ~500ms per post - **Parallel Processing**: 10 workers (configurable) - **Typical Feed Load**: 100 posts in < 1 second (cached) ### Cache Hit Rates After initial processing: - **AI Cache**: ~95% hit rate (content rarely changes) - **Filterset Cache**: ~80% hit rate (depends on TTL) - **Memory Cache**: ~60% hit rate (5min TTL) ## Monitoring ### Cache Statistics ```python stats = filter_engine.get_cache_stats() # { # 'memory_cache_size': 150, # 'ai_cache_size': 5000, # 'filterset_cache_size': 8, # 'ai_cache_dir': '/app/data/filter_cache', # 'filterset_cache_dir': '/app/data/filter_cache/filtersets' # } ``` ### Logs Filter pipeline logs to `app.log`: ``` INFO - FilterEngine initialized with 5 filtersets DEBUG - Categorizer: Cache hit for a3f5c2e8... DEBUG - Moderator: Analyzed b7d9e1f3... (quality: 0.75) DEBUG - Filter: Post passed filterset 'safe_content' DEBUG - Ranker: Post score=0.82 (q:0.75, r:0.90, s:0.70, e:0.85) ``` ## Filtersets ### no_filter - **Description**: No filtering - all content passes - **Use Case**: Default, unfiltered feed - **Rules**: None - **AI**: Disabled ### safe_content - **Description**: Family-friendly content only - **Use Case**: Safe browsing - **Rules**: - Violence < 0.3 - Sexual content < 0.2 - Hate speech < 0.1 - **AI**: Required ### tech_only - **Description**: Technology and programming content - **Use Case**: Tech professionals - **Rules**: - Platform: hackernews, reddit, lobsters, stackoverflow - Topics: technology, programming, software (confidence > 0.5) - **AI**: Required ### high_quality - **Description**: High quality posts only - **Use Case**: Curated feed - **Rules**: - Score ≥ 10 - Quality ≥ 0.6 - Readability grade ≤ 14 - **AI**: Required ## Plugin System ### Creating Custom Plugins ```python from filter_pipeline.plugins import BaseFilterPlugin class MyCustomPlugin(BaseFilterPlugin): def get_name(self) -> str: return "MyCustomFilter" def should_filter(self, post: dict, context: dict = None) -> bool: # Return True to filter OUT (reject) post title = post.get('title', '').lower() return 'spam' in title def score(self, post: dict, context: dict = None) -> float: # Return score 0.0-1.0 return 0.5 ``` ### Built-in Plugins - **KeywordFilterPlugin**: Blocklist/allowlist filtering - **QualityFilterPlugin**: Length, caps, clickbait detection ## Troubleshooting ### Issue: AI Not Working **Check**: 1. `filter_config.json`: `"enabled": true` 2. OpenRouter API key file exists 3. Logs for API errors **Solution**: ```bash # Test API key curl -H "Authorization: Bearer $(cat openrouter_key.txt)" \ https://openrouter.ai/api/v1/models ``` ### Issue: Posts Not Filtered **Check**: 1. User has filterset selected in settings 2. Filterset exists in filtersets.json 3. Posts match filter rules **Solution**: ```python # Check user settings user_settings = json.loads(current_user.settings) print(user_settings.get('filter_set')) # Should not be 'no_filter' ``` ### Issue: Slow Performance **Check**: 1. Cache enabled in config 2. Cache hit rates 3. Parallel processing enabled **Solution**: ```json { "cache": {"enabled": true}, "pipeline": {"enable_parallel": true, "parallel_workers": 10} } ``` ## Future Enhancements - [ ] Database persistence for FilterResults - [ ] Filter statistics dashboard - [ ] Custom user-defined filtersets - [ ] A/B testing different filter configurations - [ ] Real-time filter updates without restart - [ ] Multi-language support - [ ] Advanced ML models for categorization ## Contributing When adding new filtersets: 1. Define in `filtersets.json` 2. Test with sample posts 3. Document rules and use case 4. Consider AI requirements When adding new stages: 1. Extend `BaseStage` 2. Implement `process()` method 3. Use caching where appropriate 4. Add to `pipeline_config.json` ## License AGPL-3.0 with commercial licensing option (see LICENSE file)