Director of Technology

Subin Joshua

Building data infrastructure and AI systems that make GTM teams measurably more effective.

I architect production-grade pipelines, deploy AI applications, and automate everything that shouldn't be manual. Currently leading technical operations at ThinkFISH.

Download resume

India • Remote • subinjoshua.316@gmail.com

By the Numbers

22,348

Contacts in data warehouse

10,651

Sales events processed

140

LinkedIn accounts managed

4.3%

Conversion improvement

Capabilities

Systems that scale

DATA INFRASTRUCTURE

From spreadsheets to real-time analytics

PostgreSQL warehouses with dimensional modeling. Complex SQL with CTEs, window functions, and triggers. Metabase dashboards tracking full customer journeys.

AI APPLICATIONS

LLMs that ship to production

Document analysis with Claude and GPT APIs. Chatbots for support automation. Intelligent scrapers for lead enrichment. LangChain pipelines for reasoning.

AUTOMATION SYSTEMS

Eliminate repetitive work

Self-hosted n8n on Coolify/Hetzner. Webhook orchestration with deduplication. Multi-system sync across CRM, calendar, and databases.

Case Studies

Systems I've built

GTM PLATFORM

ThinkFISH Data Warehouse

Centralized PostgreSQL warehouse replacing Google Sheets chaos. Real-time pipeline tracking for 24 sales reps across 73 stages.

PostgreSQLn8nMetabase

22K+

Contacts

10K+

Events

4.3%

Conversion lift

Why, Not Just What

Decisions, not just implementations

Every tool choice has tradeoffs. Here's why I picked what I picked.

$1,200/yr → $240/yr

Why self-hosted n8n

At 10K+ events/month, Zapier's $0.01/task adds up to $1,200/year. Self-hosted n8n on a $20/month Hetzner VPS = unlimited executions, full retry control, no vendor lock-in.

Write once → Query millions

Why PostgreSQL for GTM data

Sales data is relational by nature: contacts → events → opportunities. Triggers compute meeting_sequence at write time, not query time. CTEs keep complex pipeline reports readable.

10x cost savings

Why GoHighLevel over HubSpot

For 140 LinkedIn accounts, GHL's location-based pricing beats HubSpot's per-seat model by 10x. API is rougher, but that's what n8n is for.

Production Lessons

Bugs I've shipped and fixed

Every production system has war stories. These taught me the most.

The comma that broke everything

User input: "Bruno Migaruka Kuzwayezu, ACIArb"

A contact's name with a comma crashed our entire Calendly pipeline. The comma was splitting SQL parameters mid-query.

Fix

Moved from string interpolation to proper parameterized queries.

Lesson

Never trust user input delimiters.

The duplicate webhook storm

Calendly sends: create + update + reschedule

Without deduplication, one meeting generated three database entries. Sales team was seeing triple the actual bookings.

Fix

Built webhook_receipts table with event_id unique constraint.

Lesson

Always assume external APIs will spam you.

The demo-day timeout

50+ Metabase cards, zero caching

Dashboard died during investor presentation. Each card ran a separate unoptimized query hitting cold database.

Fix

Materialized views refreshed hourly + query rewrites.

Lesson

Demo day will find your performance bugs.

Data Modeling

How I structured ThinkFISH's warehouse

Relational design with computed fields at write time. One write saves millions of reads.

Entity Relationships

PostgreSQL
-- Core entities (simplified)
customers (
  customer_id   UUID PRIMARY KEY,
  email         TEXT UNIQUE,
  source        TEXT,
  member_id     UUID REFERENCES team_members
)
    │
    │ 1:many
    ▼
booking_events (
  event_uuid       UUID PRIMARY KEY,
  customer_id      UUID REFERENCES customers,
  meeting_time     TIMESTAMPTZ,
  touch_type       TEXT,
  meeting_sequence INT  -- computed via trigger
)
    │
    │ 1:many
    ▼
deals (
  deal_id      UUID PRIMARY KEY,
  customer_id  UUID REFERENCES customers,
  pipeline_id  UUID,
  stage_id     UUID,
  status       TEXT
)
    │
    │ 1:many
    ▼
deal_stage_history (
  history_id   UUID PRIMARY KEY,
  deal_id      UUID REFERENCES deals,
  prev_stage   UUID,
  new_stage    UUID,
  changed_at   TIMESTAMPTZ DEFAULT NOW()
)

Write-Time Computation

PL/pgSQL
-- Compute meeting_sequence at INSERT, not query time
CREATE OR REPLACE FUNCTION set_meeting_sequence()
RETURNS TRIGGER AS $$
BEGIN
  NEW.meeting_sequence := (
    SELECT COUNT(*) + 1
    FROM booking_events
    WHERE customer_id = NEW.customer_id
      AND meeting_time < NEW.meeting_time
      AND status = 'active'
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_meeting_sequence
  BEFORE INSERT ON booking_events
  FOR EACH ROW
  EXECUTE FUNCTION set_meeting_sequence();

Key insight: meeting_sequence is computed via trigger at INSERT, not calculated at query time.

One write operation vs. millions of aggregation reads. The math is obvious.

Code Evolution

From spreadsheet hell to SQL clarity

The same question, answered two different ways. One of them scales.

Before

Google Sheets
"Which leads came from Tyler's
Calendly link last week?"

→ Open Google Sheets
→ Find Tyler's tab
→ Scroll to last week's dates
→ Count manually (miss some)
→ Cross-reference with other sheets
→ VLOOKUP fails on duplicates
→ Give up, estimate

Time: 15 minutes
Accuracy: ~70%
Confidence: Low

After

PostgreSQL
SELECT
  COUNT(*) as total_leads,
  AVG(meeting_sequence) as avg_touch
FROM booking_events be
JOIN team_members tm
  ON be.member_id = tm.member_id
WHERE tm.name = 'Alex Johnson'
  AND be.meeting_time > NOW() - INTERVAL '7 days';

-- Result: 23 leads, 1.4 avg touches
-- Time: 0.3 seconds
-- Accuracy: 100%

15min

→ 0.3 seconds

Time to answer

~70%

→ 100%

Accuracy

1

→ Unlimited

Concurrent users

Query I'm Proud Of

Pipeline velocity in one query

This tells leadership exactly where deals stall. No BI tool clicks, no waiting — just SQL.

Query

35 lines
-- Pipeline velocity report: where do deals stall?
WITH stage_times AS (
  SELECT
    d.deal_id,
    d.name,
    h.stage_id,
    LAG(h.changed_at) OVER (
      PARTITION BY d.deal_id
      ORDER BY h.changed_at
    ) AS entered_stage,
    h.changed_at AS left_stage
  FROM deals d
  JOIN deal_stage_history h
    ON d.deal_id = h.deal_id
  WHERE d.created_at > NOW() - INTERVAL '90 days'
)
SELECT
  s.stage_name,
  ROUND(
    AVG(
      EXTRACT(EPOCH FROM (left_stage - entered_stage)) / 86400
    )::numeric, 1
  ) AS avg_days_in_stage,
  COUNT(DISTINCT st.deal_id) AS deals,
  ROUND(
    COUNT(DISTINCT st.deal_id) * 100.0 /
    SUM(COUNT(DISTINCT st.deal_id)) OVER ()
  , 1) AS pct_of_pipeline
FROM stage_times st
JOIN pipeline_stages s ON st.stage_id = s.stage_id
WHERE st.entered_stage IS NOT NULL
GROUP BY s.stage_name, s.stage_order
ORDER BY s.stage_order;

Sample Output

stage_name     | avg_days | opps | pct
─────────────────────────────────────────
Discovery      |     1.2  |  156 | 34.2%
Qualification  |     2.8  |  142 | 31.1%
Proposal       |     3.4  |   89 | 19.5%
Negotiation    |     4.2  |   52 | 11.4%
Closed Won     |     0.5  |   17 |  3.7%

Techniques Used

CTEsReadable subquery organization
Window FunctionsLAG() for previous row data
Date MathEXTRACT(EPOCH) for duration
AggregationAVG, COUNT with GROUP BY
Percentage CalcWindow SUM for total context

Key Insight

Deals spend 4.2 days in Negotiation — that's where we lose them. Now sales knows where to focus.

Beliefs

Things I've learned the hard way

Spreadsheets are where data pipelines go to die

If your sales team lives in Google Sheets, you don't have a CRM — you have a shared anxiety document.

The best automation is invisible to users

If users notice your automation, it's probably broken. Success = "wait, that happens automatically?"

Every external webhook lies

They'll send duplicates, out-of-order events, and malformed payloads. Build like you expect them to.

AI is a tool, not a strategy

I use Claude daily for code review, data analysis, drafting. But system design? That's still human judgment.

Stack

Tools I work with

DATA

PostgreSQLMetabaseSQL

AI / LLM

OpenAIClaudeLangChainPython

AUTOMATION

n8nZapierMake.comWebhooks

GTM

HubSpotGoHighLevelCalendlySlack

ENRICHMENT

ClayApolloWebhound.ai

INFRASTRUCTURE

CoolifyHetznerREST APIsGitHub

Proof of Work

See it live

Patterns I Use

Reusable solutions I've refined across projects:

01

Webhook Deduplication

Receipt table pattern for idempotent processing

02

Retry with Backoff

Exponential backoff for failed webhook delivery

03

Event Sourcing

Stage history tracking for audit trails

04

Materialized Views

Pre-computed aggregations for dashboards

Coming soon: n8n-webhook-patterns repo with deduplication, retry logic, and error handling examples.

Experience

Career timeline

2024Director of Technology, ThinkFISH
2023Systems Manager, ThinkFISH
2022Data Scientist, Manhattan Global Partners
2021Lead Generation Manager, RealtySimplified
2019Data Analyst, Etaserve
2016L2 Support Supervisor, HP Inc.

2016B.Tech Information Technology — NIT Karnataka

Get in Touch

Let's build your data infrastructure

Download Resume