Director of Technology
Subin Joshua
Building data infrastructure and AI systems that make GTM teams measurably more effective.
I architect production-grade pipelines, deploy AI applications, and automate everything that shouldn't be manual. Currently leading technical operations at ThinkFISH.
India • Remote • subinjoshua.316@gmail.com
By the Numbers
22,348
Contacts in data warehouse
10,651
Sales events processed
140
LinkedIn accounts managed
4.3%
Conversion improvement
Capabilities
Systems that scale
DATA INFRASTRUCTURE
From spreadsheets to real-time analytics
PostgreSQL warehouses with dimensional modeling. Complex SQL with CTEs, window functions, and triggers. Metabase dashboards tracking full customer journeys.
AI APPLICATIONS
LLMs that ship to production
Document analysis with Claude and GPT APIs. Chatbots for support automation. Intelligent scrapers for lead enrichment. LangChain pipelines for reasoning.
AUTOMATION SYSTEMS
Eliminate repetitive work
Self-hosted n8n on Coolify/Hetzner. Webhook orchestration with deduplication. Multi-system sync across CRM, calendar, and databases.
Case Studies
Systems I've built
WEB DEVELOPMENT
ThinkFISH Websites
GTM PLATFORM
ThinkFISH Data Warehouse
Centralized PostgreSQL warehouse replacing Google Sheets chaos. Real-time pipeline tracking for 24 sales reps across 73 stages.
22K+
Contacts
10K+
Events
4.3%
Conversion lift
WEBHOOK INFRASTRUCTURE
Event Processing Pipeline
Why, Not Just What
Decisions, not just implementations
Every tool choice has tradeoffs. Here's why I picked what I picked.
Why self-hosted n8n
At 10K+ events/month, Zapier's $0.01/task adds up to $1,200/year. Self-hosted n8n on a $20/month Hetzner VPS = unlimited executions, full retry control, no vendor lock-in.
Why PostgreSQL for GTM data
Sales data is relational by nature: contacts → events → opportunities. Triggers compute meeting_sequence at write time, not query time. CTEs keep complex pipeline reports readable.
Why GoHighLevel over HubSpot
For 140 LinkedIn accounts, GHL's location-based pricing beats HubSpot's per-seat model by 10x. API is rougher, but that's what n8n is for.
Production Lessons
Bugs I've shipped and fixed
Every production system has war stories. These taught me the most.
The comma that broke everything
A contact's name with a comma crashed our entire Calendly pipeline. The comma was splitting SQL parameters mid-query.
Fix
Moved from string interpolation to proper parameterized queries.
Lesson
Never trust user input delimiters.
The duplicate webhook storm
Without deduplication, one meeting generated three database entries. Sales team was seeing triple the actual bookings.
Fix
Built webhook_receipts table with event_id unique constraint.
Lesson
Always assume external APIs will spam you.
The demo-day timeout
Dashboard died during investor presentation. Each card ran a separate unoptimized query hitting cold database.
Fix
Materialized views refreshed hourly + query rewrites.
Lesson
Demo day will find your performance bugs.
Data Modeling
How I structured ThinkFISH's warehouse
Relational design with computed fields at write time. One write saves millions of reads.
Entity Relationships
PostgreSQL-- Core entities (simplified)
customers (
customer_id UUID PRIMARY KEY,
email TEXT UNIQUE,
source TEXT,
member_id UUID REFERENCES team_members
)
│
│ 1:many
▼
booking_events (
event_uuid UUID PRIMARY KEY,
customer_id UUID REFERENCES customers,
meeting_time TIMESTAMPTZ,
touch_type TEXT,
meeting_sequence INT -- computed via trigger
)
│
│ 1:many
▼
deals (
deal_id UUID PRIMARY KEY,
customer_id UUID REFERENCES customers,
pipeline_id UUID,
stage_id UUID,
status TEXT
)
│
│ 1:many
▼
deal_stage_history (
history_id UUID PRIMARY KEY,
deal_id UUID REFERENCES deals,
prev_stage UUID,
new_stage UUID,
changed_at TIMESTAMPTZ DEFAULT NOW()
)Write-Time Computation
PL/pgSQL-- Compute meeting_sequence at INSERT, not query time
CREATE OR REPLACE FUNCTION set_meeting_sequence()
RETURNS TRIGGER AS $$
BEGIN
NEW.meeting_sequence := (
SELECT COUNT(*) + 1
FROM booking_events
WHERE customer_id = NEW.customer_id
AND meeting_time < NEW.meeting_time
AND status = 'active'
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_meeting_sequence
BEFORE INSERT ON booking_events
FOR EACH ROW
EXECUTE FUNCTION set_meeting_sequence();Key insight: meeting_sequence is computed via trigger at INSERT, not calculated at query time.
One write operation vs. millions of aggregation reads. The math is obvious.
Code Evolution
From spreadsheet hell to SQL clarity
The same question, answered two different ways. One of them scales.
Before
"Which leads came from Tyler's
Calendly link last week?"
→ Open Google Sheets
→ Find Tyler's tab
→ Scroll to last week's dates
→ Count manually (miss some)
→ Cross-reference with other sheets
→ VLOOKUP fails on duplicates
→ Give up, estimate
Time: 15 minutes
Accuracy: ~70%
Confidence: LowAfter
SELECT
COUNT(*) as total_leads,
AVG(meeting_sequence) as avg_touch
FROM booking_events be
JOIN team_members tm
ON be.member_id = tm.member_id
WHERE tm.name = 'Alex Johnson'
AND be.meeting_time > NOW() - INTERVAL '7 days';
-- Result: 23 leads, 1.4 avg touches
-- Time: 0.3 seconds
-- Accuracy: 100%15min
→ 0.3 seconds
Time to answer
~70%
→ 100%
Accuracy
1
→ Unlimited
Concurrent users
Query I'm Proud Of
Pipeline velocity in one query
This tells leadership exactly where deals stall. No BI tool clicks, no waiting — just SQL.
Query
35 lines-- Pipeline velocity report: where do deals stall?
WITH stage_times AS (
SELECT
d.deal_id,
d.name,
h.stage_id,
LAG(h.changed_at) OVER (
PARTITION BY d.deal_id
ORDER BY h.changed_at
) AS entered_stage,
h.changed_at AS left_stage
FROM deals d
JOIN deal_stage_history h
ON d.deal_id = h.deal_id
WHERE d.created_at > NOW() - INTERVAL '90 days'
)
SELECT
s.stage_name,
ROUND(
AVG(
EXTRACT(EPOCH FROM (left_stage - entered_stage)) / 86400
)::numeric, 1
) AS avg_days_in_stage,
COUNT(DISTINCT st.deal_id) AS deals,
ROUND(
COUNT(DISTINCT st.deal_id) * 100.0 /
SUM(COUNT(DISTINCT st.deal_id)) OVER ()
, 1) AS pct_of_pipeline
FROM stage_times st
JOIN pipeline_stages s ON st.stage_id = s.stage_id
WHERE st.entered_stage IS NOT NULL
GROUP BY s.stage_name, s.stage_order
ORDER BY s.stage_order;Sample Output
stage_name | avg_days | opps | pct
─────────────────────────────────────────
Discovery | 1.2 | 156 | 34.2%
Qualification | 2.8 | 142 | 31.1%
Proposal | 3.4 | 89 | 19.5%
Negotiation | 4.2 | 52 | 11.4%
Closed Won | 0.5 | 17 | 3.7%Techniques Used
Key Insight
Deals spend 4.2 days in Negotiation — that's where we lose them. Now sales knows where to focus.
Beliefs
Things I've learned the hard way
“Spreadsheets are where data pipelines go to die”
If your sales team lives in Google Sheets, you don't have a CRM — you have a shared anxiety document.
“The best automation is invisible to users”
If users notice your automation, it's probably broken. Success = "wait, that happens automatically?"
“Every external webhook lies”
They'll send duplicates, out-of-order events, and malformed payloads. Build like you expect them to.
“AI is a tool, not a strategy”
I use Claude daily for code review, data analysis, drafting. But system design? That's still human judgment.
Stack
Tools I work with
DATA
PostgreSQL • Metabase • SQL
AI / LLM
OpenAI • Claude • LangChain • Python
AUTOMATION
n8n • Zapier • Make.com • Webhooks
GTM
HubSpot • GoHighLevel • Calendly • Slack
ENRICHMENT
Clay • Apollo • Webhound.ai
INFRASTRUCTURE
Coolify • Hetzner • REST APIs • GitHub
Proof of Work
See it live
Links
Patterns I Use
Reusable solutions I've refined across projects:
Webhook Deduplication
Receipt table pattern for idempotent processing
Retry with Backoff
Exponential backoff for failed webhook delivery
Event Sourcing
Stage history tracking for audit trails
Materialized Views
Pre-computed aggregations for dashboards
Coming soon: n8n-webhook-patterns repo with deduplication, retry logic, and error handling examples.
Experience
Career timeline
2016—B.Tech Information Technology — NIT Karnataka
Get in Touch