Senior Associate/Assistant Vice President, Agent Harness/Assurance Specialist

Location:

SG, 238891

Group: Corporate Group

Department: Technology

Section: Applications, Data & Digital

Job Type: Permanent

Req ID: 12087

Temasek is a global investment company headquartered in Singapore, with a net portfolio value of S$434 billion (US$324 billion, €299 billion, £250 billion, and RMB2.35 trillion) as at 31 March 2025. Marking our unlisted assets to market would provide S$35 billion of value uplift and bring our mark to market net portfolio value to S$469 billion.

Our Purpose “So Every Generation Prospers” guides us to make a difference for today’s and future generations.

Operating on commercial principles, we seek to deliver sustainable returns over the long term.

We have 13 offices in 9 countries around the world: Beijing, Hanoi, Mumbai, Shanghai, Shenzhen, and Singapore in Asia; and Brussels, London, Mexico City, New York, Paris, San Francisco, and Washington, DC outside Asia. 

For more information on Temasek, please visit www.temasek.com.sg.
For Temasek Review 2025, please visit www.temasekreview.com.sg.
For Sustainability Report 2025, please visit https://www.temasek.com.sg/content/dam/temasek-corporate/sustainability/2025/Temasek-Sustainability-Report-2025.pdf.

Introduction

As Temasek deploys AI agents into production investment workflows, the question of whether those agents are behaving correctly, consistently, and within their intended scope becomes both a technical engineering challenge and a governance imperative. Agent behaviour cannot be verified by reading code alone — it must be tested empirically, monitored continuously, and evaluated against defined behavioural specifications that encode what the agent should and should not do.

The Agent Harness / Assurance Specialist is a specialist engineering and evaluation role responsible for designing and operating the systems that verify, validate, and continuously monitor the behaviour of Temasek's AI agents. This role combines context engineering (the design of agent harnesses that shape and constrain agent behaviour), AI evaluation (the systematic assessment of whether agents meet their behavioural specifications), and operational assurance (the detection and management of agent drift, failures, and out-of-scope behaviour in production). It is a rare, high-value specialisation at the frontier of enterprise AI deployment practice.

Responsibilities

Agent harness and context engineering

Design and implement agent harnesses — the scaffolding of system prompts, context injection patterns, tool permission scopes, memory structures, and behavioural constraints that shape how AI agents operate within intended boundaries.
Develop and maintain behavioural specifications for deployed agents, including intended behaviour, permissible tool usage, decision-making boundaries, escalation triggers, and prohibited actions, serving as the ground truth against which agent behaviour is evaluated.
Engineer dynamic context management systems, determining what information is injected into agent context at each workflow step, how context is prioritised and filtered, and how context is managed across multi-turn interactions.
Build and maintain prompt versioning and change management infrastructure, tracking prompt changes across agent versions, evaluating behavioural impacts before production deployment, and maintaining rollback capability for agent configurations.
Research and apply emerging techniques in context engineering and behaviour steering, including constitutional AI prompting, chain-of-thought elicitation, self-critique patterns, and structured output enforcement, translating research insights into production improvements.

AI behaviour evaluation and testing

Design and operate evaluation frameworks for deployed agents, including automated evaluation suites assessing task completion accuracy, instruction-following fidelity, tool-calling correctness, edge-case handling, and adversarial robustness.
Build and maintain evaluation datasets covering expected inputs, edge cases, known failure modes, and adversarial prompts, with ground-truth labels and scoring rubrics tailored to each agent's domain.
Run systematic red-team evaluations to test boundary conditions, identify out-of-scope behaviours, probe prompt injection vulnerabilities, and assess agent performance under ambiguous, conflicting, or adversarial inputs, with structured reporting and remediation recommendations.
Define and enforce evaluation gates within the AI product deployment pipeline, requiring agents to meet behavioural benchmarks before production release and automating regression testing for prompt and model version changes.
Partner with AI product teams to design user feedback mechanisms that generate ongoing evaluation signals from production usage, identifying gaps between agent behaviour and user expectations and feeding insights into dataset curation and agent improvement.

Requirements

Experience and background

4–7 years of experience in AI/ML engineering, AI safety research, AI quality assurance, or a related technical discipline, with hands-on experience evaluating, testing, or governing LLM or agentic AI systems.
Demonstrated expertise in at least two of the following: LLM evaluation frameworks; prompt engineering and context architecture for production agents; AI red-teaming and adversarial testing; or AI system monitoring and observability.
Experience with production AI systems in regulated or high-stakes environments is strongly preferred, particularly where governance and risk management requirements are significant.
Familiarity with AI safety and alignment research is advantageous, including RLHF and its limitations, constitutional AI, interpretability research, and agent safety evaluation methodologies.

Technical capabilities

LLM and agent tooling: Deep familiarity with Anthropic Claude and OpenAI APIs (including tool use, structured outputs, and system prompt design); experience with orchestration frameworks such as LangChain, LangGraph, and AutoGen.
Evaluation frameworks: Hands-on experience with LLM evaluation tools (e.g. LangSmith Evals, RAGAS, DeepEval, or equivalent); ability to design evaluation datasets, scoring rubrics, and interpret results with statistical rigour.
Monitoring and observability: Experience with AI observability platforms (e.g. LangFuse, Helicone, Weights & Biases); ability to implement structured logging and behavioural anomaly detection.
Programming: Strong Python proficiency for evaluation scripting, data analysis, and automation; familiarity with pandas, numpy, and related analytical libraries.

Mindset and approach

Adversarial thinker: Proactively probes systems for weaknesses, thinks like an attacker, and validates that controls work in practice—not just on paper.
Rigorous and systematic: Applies a scientific approach to evaluation, with clear hypotheses, controlled testing, reproducible results, and objective reporting of findings.
Collaborative assurance partner: Partners effectively with AI engineers and product managers to improve agent quality and safety while enabling delivery.
Continuous learner: Keeps pace with emerging developments in AI evaluation and agent safety, translating research advances into practical production improvements.