Add filter pipeline core infrastructure (Phase 1)

Implements plugin-based content filtering system with multi-level caching:

Core Components:
- FilterEngine: Main orchestrator for content filtering
- FilterCache: 3-level caching (memory, AI results, filterset results)
- FilterConfig: Configuration loader for filter_config.json & filtersets.json
- FilterResult & AIAnalysisResult: Data models for filter results

Architecture:
- BaseStage: Abstract class for pipeline stages
- BaseFilterPlugin: Abstract class for filter plugins
- Multi-threaded parallel processing support
- Content-hash based AI result caching (cost savings)
- Filterset result caching (fast filterset switching)

Configuration:
- filter_config.json: AI models, caching, parallel workers
- Using only Llama 70B for cost efficiency
- Compatible with existing filtersets.json

Integration:
- apply_filterset() API compatible with user preferences
- process_batch() for batch post processing
- Lazy-loaded stages to avoid import errors when AI disabled

Related to issue #8 (filtering engine implementation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-10-11 22:46:10 -05:00
parent 07df6d8f0a
commit 94e12041ec
10 changed files with 1143 additions and 0 deletions

27
filter_config.json Normal file
View File

@@ -0,0 +1,27 @@
{
"ai": {
"enabled": false,
"openrouter_key_file": "openrouter_key.txt",
"models": {
"cheap": "meta-llama/llama-3.3-70b-instruct",
"smart": "meta-llama/llama-3.3-70b-instruct"
},
"parallel_workers": 10,
"timeout_seconds": 60,
"note": "Using only Llama 70B for cost efficiency"
},
"cache": {
"enabled": true,
"ai_cache_dir": "data/filter_cache",
"filterset_cache_ttl_hours": 24
},
"pipeline": {
"default_stages": ["categorizer", "moderator", "filter", "ranker"],
"batch_size": 50,
"enable_parallel": true
},
"output": {
"filtered_dir": "data/filtered",
"save_rejected": false
}
}