Crowdsourcing Became the AI Data Supply Chain

Sasha Lopashev — Mon, 08 Jun 2026 05:34:13 GMT

AI did not kill crowdsourcing.

It made crowdsourcing confidential, expert-led, and harder to audit.

That is the conclusion I did not want to reach.

For years I kept returning to the same problem: crowdsourcing platforms are marketplaces, but they are not fair marketplaces. One side sets the price, defines the task, accepts or rejects the work, controls payout timing, and usually controls the appeal process too. The other side does the work and hopes the rules will be applied honestly.

The paying side is called the requester. The non-paying side is called the worker. That language is already the whole story.

I am writing this because I think one version of this idea is probably dead. For years I wanted a fairer open market for digital labor: transparent rules, automatic payouts, portable reputation, and financial penalties for bad behavior on both sides. The AI boom did not make that dream come true. It absorbed the work into corporate data supply chains. But the mechanism underneath the dream still matters. This post is my attempt to separate the dead product idea from the live technical problem.

The Old Unfair Bargain

I first became obsessed with this as a contributor to Google Crowdsource. Later, from 2017 to 2019, I built and owned an internal crowdsourcing platform at Indeed. We had hundreds of global users working on quality improvements for aggregation pipelines. It was not a toy system. It had real operational use, real quality concerns, and real incentives hiding underneath the interface.

The work lived in the long tail where machine learning pipelines were useful but not enough. A job could appear in multiple places with slightly different wording, different titles, different company names, different locations, and different levels of freshness. Companies and job postings had to be stitched from multiple sources: large feeds, small company websites, pages from the physical world that later became digital inputs, and all the awkward edge cases that refuse to fit a clean schema.

Before that platform existed, the process was almost comically manual: spreadsheets upon spreadsheets. My hot take is that almost every product on the Internet is a glorified spreadsheet, but this was the rare case where the spreadsheet had become the product architecture. One person managed other people, who managed spreadsheets to split work between outside contractors, who returned more spreadsheets, which then had to be merged into another spreadsheet, before someone could manually push the final result into the system of record and kick off the real data pipelines.

The platform replaced that with a workflow. Workers subscribed to the countries, languages, and task types they could handle. Input pipes were automatic. Output grading was automatic and immediate. The results fed the systems that needed them instead of waiting for a human spreadsheet ceremony. Internal estimates put the savings in the low millions, and just as importantly, it freed people from managing the mechanics of work distribution so they could do more useful work.

And because software has a sense of humor, it started as a side project. It was supposed to be a temporary stop-gap until deeper systems evolved enough to replace it. That took about four years. For roughly three of them, the platform ran in maintenance mode with minimal input from me.

What stayed with me was this: quality problems in crowdsourcing are rarely just quality problems. They are incentive problems.

Classic crowdsourcing had a simple pitch: break large problems into small tasks, send those tasks to a distributed workforce, aggregate the answers, and get human judgment at scale.

That model gave us image labels, map corrections, translation checks, entity validation, content moderation, search relevance judgments, surveys, and countless small acts of data maintenance that made digital products feel cleaner than they really were.

But structurally, the bargain was always lopsided.

The requester wants low cost, high quality, fast turnaround, low management overhead, and the option to reject bad work.

The worker wants predictable pay, clear instructions, fair evaluation, fast settlement, and a reputation that can survive beyond one platform.

The platform wants throughput, margin, buyer retention, enough worker supply to keep prices down, and enough opacity to manage disputes at scale.

These goals overlap, but they do not naturally align.

The quality problem is produced by incentives before it shows up in review queues.

The worst version looks like this: the requester under-specifies the task, the worker guesses what will be accepted, the platform measures surface-level agreement, and the worker absorbs the cost of ambiguity. If the submission is accepted, the worker earns a tiny payout. If it is rejected, the worker has already spent the time.

That is not just unfair. It is bad data engineering.

When a system rewards speed more reliably than care, it should not be surprised when it gets speed. When a system punishes workers for ambiguity they did not create, it should not be surprised when workers avoid hard tasks, copy common answers, or learn whatever shortcut gets past the filter.

The platform then calls this a quality-control problem.

I think that is backwards. It is a market-design problem that later expresses itself as a quality-control problem.

Why The Open Market Felt Hollow

There was another reason I kept doubting the open crowdsourcing thesis: a lot of the open market already looked compromised.

While I was prototyping my first attempt at a fairer marketplace, many of the open microtask platforms I found were not full of the noble “human intelligence tasks” I wanted to believe in. They were full of tiny campaigns to manufacture attention and trust: like a YouTube video, follow an account, upvote a post, write a product review, install an app, leave a comment, make something look more popular than it was.

Some of this has a formal name: crowdturfing. Researchers have studied human-powered crowdsourcing platforms being used for astroturf campaigns, social-media manipulation, malicious URL distribution, fake likes, and fake reviews. One paper looked at RapidWorkers campaigns targeting Amazon reviews and reported tasks paying workers from $0.10 to $1.50. The FTC’s later rule banning fake reviews, false testimonials, and fake social-media influence indicators is a useful signal of how normalized this whole pattern became.

That was not the entire market, obviously. There were legitimate data tasks, research tasks, moderation tasks, and labeling work. But the public surface area felt polluted. The open platforms that were easiest to inspect often looked less like a future of dignified digital labor and more like a liquidity pool for plausible deniability.

Mechanical Turk gave me the opposite signal. I tried to get in as a worker, partly just to test and explore the experience, and it was surprisingly hard. My read at the time was simple: if a platform makes it difficult to join the worker side, it probably does not have a worker shortage. It has a demand problem, a quality problem, or both.

Toloka gave me a third signal: the economics could be absurd because the platform had enough supply to make them absurd. Some search relevance tasks were genuinely hard. Comparing two search results pages is not “click the better one.” It can require reading detailed criteria, understanding user intent, comparing freshness, authority, language, location, and edge cases, and then applying that rubric consistently across a batch. Toloka itself describes modern search relevance evaluation as depending on human understanding of language, context, calibration, and guideline interpretation.

And yet the pay could look like pennies against pages of criteria.

That combination told me something important. Open crowdsourcing was not only unfair because requesters had power. It was unfair because the market had learned that it could spend human attention like a nearly free resource.

So by the time AI labs started buying more specialized human data, I had two conflicting intuitions. The idealist in me still wanted a fair open labor market. The operator in me suspected the open market had already lost too much trust.

How Crowdsourcing Became AI Data Infrastructure

The old word was crowdsourcing.

The new words are human data, RLHF, preference data, model evaluation, red-teaming, expert demonstrations, coding traces, synthetic-data validation, agent trajectories, and frontier-model benchmarks.

The labor became more valuable, but it moved into confidential corporate wrapper.

Scale AI describes itself around training data, annotation, RLHF, model evaluation, red-teaming, and AI applications. Surge AI frames its work as human intelligence for frontier AI. Datacurve focuses on coding data, long-horizon software tasks, agent trajectories, supervised fine-tuning, and evaluation benchmarks.

This is not the same market wearing a new jacket. The center of gravity really changed.

The buyer is no longer always a random requester posting microtasks. It may be an AI lab, a government agency, or an enterprise customer. The task is no longer always a simple label. It may be a preference ranking, adversarial prompt, rubric-based evaluation, code review, legal analysis, medical reasoning, or a trace of how an expert completes a hard workflow.

The work became more valuable.

It also became less visible.

That matters because the old fairness problem did not go away. It was absorbed into a more confidential, more corporate supply chain.

AI labs need human data precisely where models are weak: nuanced judgment, edge-case behavior, culturally specific reasoning, safety failures, long-horizon tasks, and domains where correctness is expensive. This kind of data is valuable because it is not easily scraped from the public internet.

But once the data becomes proprietary, the fairness system becomes harder to inspect. Workers may not know how their output is used. Customers may not know how workers were selected, paid, evaluated, or replaced. Regulators may ask for data lineage without being able to see the raw data. Platforms may be trusted because there is no practical alternative.

This is the moment where provenance becomes interesting again.

Not as a crypto slogan. As infrastructure.

Why Fairness Is A Data-Quality Problem

It is tempting to talk about worker fairness as an ethical add-on: something nice to have after the product, the platform, and the buyer workflow are solved.

I think that is a mistake.

Worker fairness is upstream data-quality infrastructure.

If workers do not trust the rules, they adapt. If they do not trust rejection decisions, they optimize for the acceptance mechanism rather than the task. If they are paid too little for hard judgment, they avoid hard judgment. If the platform silently changes quality thresholds, workers learn superstition instead of skill.

And now there is a new problem: workers can use AI too.

That is not automatically bad. There are legitimate human-AI workflows where a model helps draft, summarize, search, or structure work. But if a customer is buying independent human judgment and instead receives a pile of correlated LLM-shaped answers, the dataset is compromised in a very specific way. It may look clean. It may even pass simple agreement tests. But it is no longer the signal the buyer thought they were purchasing.

In old crowdsourcing, the cheap failure mode was spam.

In AI data work, the cheap failure mode is plausible synthetic consensus.

That means a serious platform has to ask a better question than “did the worker match the expected answer?”

It has to ask:

Did this worker provide independent, informative signal under the task’s rules?

That is much harder. It also brings us back to mechanism design.

What A Practical Fairness Layer Looks Like

A fairer system does not mean every worker gets paid for every submission.

That would fail immediately. Requesters do need protection from spam, fraud, low-effort work, copied answers, collusion, and model-generated submissions masquerading as independent human judgment.

Fairness means something narrower and more technical:

The rules should be known before work begins, and neither side should be able to quietly rewrite them after seeing the result.

The core mechanism is two-sided financial commitment.

In a trustless environment, there is only one pragmatic forcing function: money. Not money as a moral philosophy. Money as rewards and penalties.

A legitimate participant accepts the rules of engagement before joining. Good behavior earns money and reputation. Bad behavior loses money and reputation. Reputation should behave like it does in the real world: slow to gain, quick to lose. Money deters drive-by abuse from newcomers. Reputation governs long-term relationships.

The requester escrows enough money to prove the project is real and to make late cancellation costly. The worker attaches a proof-of-value stake to each submission, proving that they believe the submission is worth evaluating. If the work passes the declared rule, the worker gets the stake back with the reward. If the work is honest but inconclusive, the stake should usually be returned. If the work is fraudulent, low-effort, copied, or clearly automated against the task rules, the stake can be slashed.

That symmetry was the point. Requesters should not be able to extract work and walk away. Workers should not be able to spray low-quality submissions for free and hope a few pass.

This aligns all three parties around value creation. The requester wants useful work results. The worker wants money and reputation. The platform wants transaction fees and repeat participation. Nobody wants to lose money. Every good actor wants the game to continue.

At minimum, a fairness layer should guarantee a few things.

First, funded work. A worker should not submit into a project whose reward pool does not exist.

Second, known rules. Acceptance logic should be declared before work begins.

Third, bounded requester discretion. A requester should not be able to reject useful work after extracting the value of the submission.

Fourth, worker proof-of-value. Each submission should carry either a small financial stake, a reputation stake, or both, so low-effort, fraudulent, or automated submissions are not free to spray across the system.

Fifth, quality aggregation. Individual answers should be evaluated through a task-appropriate method, not purely through arbitrary requester review.

Sixth, payout finality. Once the declared rule is satisfied, settlement should happen automatically.

Seventh, auditability. The system should preserve enough metadata to reconstruct how data was produced, accepted, rejected, appealed, and paid.

Eighth, confidentiality. Auditability should not require exposing raw customer data, private worker identity, proprietary prompts, or unpublished model failures.

A fairness layer does not remove quality control. It turns discretionary rejection into declared rules, two-sided stakes, audit trails, and automatic settlement.

The simplest useful primitive on the requester side is escrow.

Before the project begins, the requester funds the work. A portion of the budget is reserved for accepted submissions. Another portion can be reserved for partial cancellation, failed project conditions, or worker reimbursement when the requester changes direction after participation has begun.

The equivalent primitive on the worker side is proof-of-value staking.

This is the unfortunate part: to earn money as a worker, you may need money to stake your submissions. I do not like that. But in an open, low-trust system, some downside has to exist before reputation exists. The stake can be small, proportional to task value, and partly reputational once the worker has a track record. The important property is not that workers risk a lot of money. The important property is that each submission carries some downside when it is provably bad faith.

Over time, requesters could waive or reduce staking for workers with strong platform reputation, if they trust the platform’s reputation model. They could also waive staking for specific workers they have had good prior experience with. Money is the bootstrapping deterrent. Reputation is the long-term relationship layer.

The key rule is:

If the requester has received enough information to benefit from the work, the requester should not retain unilateral authority to cancel payment.

This does not eliminate rejection. It changes who gets to decide and under what constraints.

For example, imagine a project that asks workers to verify and normalize job postings across countries and languages. The project could declare its rules before anyone starts:

workers are eligible only for the countries, languages, and task types they have opted into or qualified for
each posting cluster needs at least 3 independent judgments before it can be accepted automatically
fields with objective checks, like source URL, company website, or location, are graded differently from fuzzy fields, like normalized title or duplicate detection
calibration tasks are mixed into each batch and update worker reliability
each submission carries a small proof-of-value stake, returned when the submission is accepted or honestly unresolved, and slashed when it violates declared task rules
workers with high reputation, or workers explicitly trusted by the requester, may receive reduced or waived staking requirements
disagreements above a threshold route into a second-stage review task instead of becoming arbitrary rejection
payment settles when the declared rule completes, while any requester override creates an auditable appeal record

That is a practical fairness layer. It does not require pretending every answer is equally good. It also does not let the requester wait until after the work is visible and then quietly redefine what counted.

The important shift is that rejection becomes rule-bound. The requester can still define what good work means. The platform can still enforce fraud rules. The worker can still lose reputation for bad submissions.

But the worker is no longer donating time into a black box.

Provenance Without Exposing The Data

My older version of this idea leaned toward blockchain. I still understand why.

If the platform has too much power, make the important events hard to rewrite. If requesters can reject work after seeing it, make the acceptance rules public before the work begins. If workers do not trust payout timing, make settlement automatic. If datasets become valuable, preserve the chain of how they were made.

The mistake is assuming the data itself should be public.

In the current AI market, that is often impossible. The tasks may contain private prompts, customer data, benchmark items, model weaknesses, expert workflows, or safety failures. Public exposure could destroy the value of the dataset or create real security and privacy risk.

So the better design is not “public data on-chain.”

It is “verifiable provenance with selective disclosure.”

The valuable record is not raw public data. It is a verifiable history of rules, versions, evaluations, payouts, and overrides.

A practical system might record the project rule commitment, asset hashes or encrypted references, task version identifiers, worker pseudonymous identifiers, submission commitments, evaluation outcomes, payout events, appeal or override events, and dataset export manifests.

The raw assets can remain in controlled storage. The ledger can be a permissioned append-only log, a transparency log, a database with cryptographic commitments, or a public chain for a narrow subset of events.

The implementation is less important than the verification question:

Who needs to prove what, to whom, without revealing more than necessary?

A worker may need to prove they were paid under the declared rule. A requester may need to prove a dataset was created through a specific workflow. A regulator may need evidence of lineage, consent, evaluation, and data governance. A model developer may need to prove that a benchmark was not contaminated.

Those are provenance problems.

They just do not require the romantic version of decentralization.

What Still Feels Alive

This is the part I resisted.

I wanted the answer to be a fair, open digital labor market. Anyone could participate. Work would be transparently priced. Workers would have portable reputation. Requesters would get high-quality results. The platform would become a neutral broker rather than a private judge.

I still like that world.

I am less convinced it is the world we are getting.

Enterprise buyers prefer vendors. They want contracts, account management, security review, indemnity, compliance posture, and someone to blame. AI labs do not want their most sensitive evaluation data floating through a public marketplace. High-value tasks increasingly require verified expertise. Open participation increases adversarial pressure. Absolute anonymity conflicts with some compliance requirements. Crypto adds friction before a worker earns a cent.

So the original marketplace thesis may be dead.

But the mechanism is not.

The buyer changed.

The open marketplace wanted fairness as labor-market infrastructure.

The AI data market may need fairness as quality, compliance, and audit infrastructure.

That is a colder framing. It is also probably more true.

Three ideas survived my own skepticism.

First: fairness is quality infrastructure.

The way workers are paid, evaluated, rejected, and appealed changes the data they produce. Treating labor conditions as separate from data quality is a category error.

Second: provenance is compliance infrastructure.

The valuable record is not raw public data. It is verifiable lineage: who did what kind of work, under which rules, against which task version, with what evaluation, and with what settlement.

Third: the next hard problem is adversarial human data.

Collecting labels is no longer the impressive part. The hard part is eliciting independent, high-effort, task-faithful signal when both humans and models can cheaply imitate low-quality work.

That problem sits underneath RLHF, model evaluation, red-teaming, expert data, benchmarks, and agent training.

The old question was:

How do we get enough humans to label enough data cheaply enough?

The new question is:

How do we know the human signal we bought is real, independent, fairly paid, and governed by rules that survived contact with incentives?

I started with a romantic idea: a fairer marketplace for human intelligence tasks.

The world moved. AI made human data more valuable, but also more confidential. The open crowd was folded into corporate infrastructure. The microtask became an expert trace. The label became a preference, a critique, a demonstration, a red-team attempt, a workflow, a judgment.

So maybe the product I imagined is dead.

But the problem is not.

The deeper problem was never “how do we put crowdsourcing on a blockchain?”

The deeper problem was:

How do we design a data labor market where the party producing the signal can trust the rules, and the party buying the signal can trust the quality?

That problem is now sitting underneath frontier AI.

Crowdsourcing did not die.

It became part of the corporate AI data supply chain.

The question is whether the next generation of AI data infrastructure will make that labor auditable, fair, and worth doing, or merely more efficiently abstracted away.

Sasha Lopashev