Talent and Capital Graphs Log
Project goal: Build a tool to surface emerging founders, researchers, and technologists early, by tracking patterns of collaboration and influence as they grow across (first) arXiv, (then) GitHub, and Twitter. (I built an earlier version of this in 2015.)
Talent Graphs Log
07/01/25
Ok so here is my problem:
API rate limits on Twitter mean that for the people I want I can’t pull their follower lists and therefore can’t build a graph.
Similarly with arxiv there are so many papers and authors in the past that I also can’t build a graph for the last like 5 years say.
Maybe I can start adding things day by day like pull papers listed on arxiv each day for a specific category and build the graph slowly day by day but that really won’t give me an accurate representation of centrality because it’s only a subsection of the overall graph (of all authors on arxiv).
Similarly centrality measures that only look at a subsection of people aren’t super helpful.
What i really want is the CHANGE in centrality over time of people
How do I meaningfully track change in centrality over time, even if I can only see part of the graph?
Reframe the goal: I want to identify individuals whose graph footprint is growing quickly, especially early. Who is becoming more connected, not just who is currently central
Recipe for how to do this using arxiv authorship
Step 1: Track new relationships daily
Build a time-labeled edge graph that says: on this day, these authors became connected (or deepened their existing connection).
Every day, pull the latest cs.LG, cs.CL, cs.AI papers from arXiv
For each new paper, record: authors, paper_id, date (QUESTION: do I need more than this like their company or university affiliation? Presumably if they’re rising quickly that’s the higher order bit, and indeed I want to find under the radar people first. Going to just implement this first and revisit).
Create (:Author)-[:COAUTHORED {date}]→(:Author) edges per day (don’t collapse duplicates)
Step 2: Compute "local growth centrality" over time
Define a few simple, proxy signals to track reliably over time:
New Coauthors in Last month: Are they collaborating with new people?
Repeat Collaborators: Are they a hub for their group?
Number of unique coauthors per month (QUESTION: is this the right timeframe?): Growth of their network
Edges per time window: Engagement/velocity proxy
Step 3: Add lightweight labels to nodes
As the graph builds over time, add labels:
SET a:Growing or: SET a.growth_score = 0.6
So can do "centrality-style" filtering later, using growth metrics
Step 4: Track delta in metrics over time (visualize slope of attention, not position)
Store rolling stats: Every day, store a.num_coauthors_30d, a.num_coauthors_30d
Compare with last week, score the delta
Later: Augment the graph with other data:
Add GitHub authorship data for the same names
Add mentions from podcasts, newsletters, or Substack authors (don’t want info to be too lagging though, the goal here is to get leading signals, or at least as early as possible).
Run LLMs to extract latent topics and label clusters
Over time, this creates a hybrid graph of code, writing, and collaboration