duplicate-detector
Research Agent
—What it does
The duplicate detector finds semantic duplicates — functions that serve the same purpose with different names or implementations. It extracts a function catalog, categorizes by domain, and uses Opus to identify consolidation opportunities with confidence ratings.
—Why it exists
Codebases accumulate duplicates organically. Three different formatDate functions, five string sanitizers, two path normalizers. This agent finds them so you can consolidate before the codebase becomes unmaintainable.
Source document
Duplicate Function Detector
Detect semantic code duplication - functions that do the same thing but have different names or implementations. Especially useful for codebases where new functions are often created rather than reusing existing ones.
When to Use
- Codebase has grown organically with multiple contributors
- You suspect utility functions have been reimplemented multiple times
- Before major refactoring to identify consolidation opportunities
- After syntactic duplicate detection (jscpd) has been handled
Process
1. Extract function catalog → catalog.json
2. Categorize by domain → categorized.json (haiku)
3. Split into categories → categories/*.json
4. Find duplicates per category → duplicates/*.json (opus)
5. Generate report → report.md
Phase 1: Extract Function Catalog
./agents/research/duplicate-detector/extract-functions.sh src/ -o catalog.json
Options:
-o FILE: Output file (default: stdout)-c N: Lines of context to capture (default: 15)-t GLOB: File types (default:*.ts,*.tsx,*.js,*.jsx)--include-tests: Include test files (excluded by default)
Phase 2: Categorize by Domain
Dispatch a haiku subagent:
Read the function catalog at catalog.json and categorize each function.
Assign each function to exactly ONE category based on its primary purpose.
## Categories
- **file-ops**: Reading, writing, path manipulation, directory operations
- **string-utils**: Formatting, parsing, sanitization, case conversion, truncation
- **validation**: Input checking, schema validation, type guards, assertions
- **error-handling**: Error creation, wrapping, formatting, logging helpers
- **http-api**: Request building, response parsing, URL construction, headers
- **date-time**: Date formatting, parsing, comparison, timezone handling
- **data-transform**: Mapping, filtering, normalization, serialization
- **database**: Query building, connection management, migrations
- **logging**: Log formatting, debug helpers, telemetry
- **config**: Configuration loading, environment variables, settings
- **async-utils**: Promise helpers, retry logic, debounce, throttle
- **testing**: Test utilities, mocks, fixtures, assertions
- **ui-helpers**: DOM manipulation, event handling, component utilities
- **crypto**: Hashing, encryption, token generation
- **other**: Doesn't fit above categories (note subcategory in purpose)
## Output Format
For each function, output:
{"file": "...", "name": "...", "line": N, "category": "...", "purpose": "one sentence"}
Use the Write tool to save the complete JSON array to categorized.json.
Phase 3: Split into Categories
./agents/research/duplicate-detector/prepare-category-analysis.sh categorized.json ./categories
Creates one JSON file per category. Only analyze categories with 3+ functions.
Phase 4: Find Duplicates (Per Category)
For each category with 3+ functions, dispatch an opus subagent:
You are analyzing functions in the "{CATEGORY}" category for semantic duplicates.
Semantic duplicates are functions that serve the SAME PURPOSE even if:
- They have different names
- They use different implementations
- They have slightly different signatures
- One is more general than another
## Your Task
1. Compare all functions in this category
2. Identify groups of functions that do the same thing
3. For each duplicate group, assess confidence and recommend action
## Output Format
Return a JSON array of duplicate groups:
[
{
"intent": "<what these functions all do>",
"confidence": "HIGH|MEDIUM|LOW",
"functions": [
{"file": "<path>", "name": "<name>", "line": <N>, "notes": "<specifics>"}
],
"differences": "<how implementations differ>",
"recommendation": {
"action": "CONSOLIDATE|INVESTIGATE|KEEP_SEPARATE",
"survivor": "<which function to keep>",
"reason": "<why>"
}
}
]
## Confidence Levels
- **HIGH**: Definitely the same. Same input→output semantics.
- **MEDIUM**: Likely the same with minor differences.
- **LOW**: Possibly related, worth investigating.
## Recommendations
- **CONSOLIDATE**: Keep the one with better name/implementation/tests.
- **INVESTIGATE**: Need full implementations to decide. Flag for human review.
- **KEEP_SEPARATE**: Functions look similar but serve distinct purposes.
## Functions in "{CATEGORY}" Category
<INSERT_CATEGORY_JSON_HERE>
Phase 5: Generate Report
./agents/research/duplicate-detector/generate-report.sh ./duplicates ./duplicates-report.md
Produces a prioritized markdown report grouped by confidence level.
Phase 6: Human Review
Review the report. For HIGH confidence duplicates:
- Verify the recommended survivor has tests
- Update callers to use the survivor
- Delete the duplicates
- Run tests
High-Risk Duplicate Zones
Focus extraction on these areas first:
| Zone | Common Duplicates |
|---|---|
utils/, helpers/, lib/ | General utilities reimplemented |
| Validation code | Same checks written multiple ways |
| Error formatting | Error-to-string conversions |
| Path manipulation | Joining, resolving, normalizing paths |
| String formatting | Case conversion, truncation, escaping |
| Date formatting | Same formats implemented repeatedly |
Common Mistakes
- Extracting too much: Focus on exported functions and public methods
- Skipping categorization: Going straight to duplicate detection produces noise
- Using haiku for detection: Use Opus for the actual duplicate analysis
- Consolidating without tests: Ensure the survivor has tests first