Mindrift logo

Freelance Agent Evaluation Engineer

Mindrift
Department:HR
Type:REMOTE
Region:Australia
Location:Australia
Experience:Mid-Senior level
Estimated Salary:$52,000 - $104,000
Skills:
PYTHONFASTAPIJAVASCRIPTTYPESCRIPTREACTDOCKERPOSTGRESKAFKAREDISTESTING
Share this job:

Job Description

Posted on: June 12, 2026

Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.What This Opportunity Involves We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks. You'll create challenging tasks and evaluation criteria within realistic simulated environments:

  • Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history
  • Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent
  • Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient
  • Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust

What This Is NOT

  • Not data labeling
  • Not prompt engineering
  • Not writing code from scratch - the agent writes most of the code; you guide and evaluate

What We Look For

  • 5+ years in software development
  • Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis
  • Experience writing tests (functional, integration)
  • English proficiency - B2+

Why this is hard Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds. How It Works Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid Effort estimate Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted. Compensation Up to $50/hr equivalent, depending on level and pace. Tasks are estimated at :20 hours each; you set your own schedule.

Originally posted on LinkedIn

Apply now

Please let the company know that you found this position on our job board. This is a great way to support us, so we can keep posting cool jobs every day!

RemoteInAustralia.com logo

RemoteInAustralia.com

Get RemoteInAustralia.com on your phone!