When the workflow calls the Thunk web-scraping tool to fetch and extract data from multiple URLs, several pages sometimes return identical JSON outputs–even though the pages themselves differ. So several unique websites produce the same extracted record.
I don’t know anything about this, and I hope this isn’t annoying – here are ChatGPT’s ideas of things to check:
Whether the tool is caching or reusing responses across URLs
Whether the extractor resets completely between pages so data from one run can’t persist into another
Whether a fetch error or timeout causes the tool to reuse the last successful JSON instead of returning an error
Whether the tool’s output can include the final URL and a content hash to confirm each record’s source
ChatGPT thinks it’s a data-reuse or caching issue within the scraping component. Maybe some of the above suggestions are ridiculous – I don’t know anything about this stuff… I just know that it makes our work very challenging if we can’t trust the JSON.
(I wish I knew more about this stuff so that I didn’t have to say “ChatGPT said this…!”)
Hi Tony - I’m so sorry - I can’t find an instance of it! I guess I was basically just stating a general concern that different tools and the AI at different points in the workflow seem to be able to bleed into what is being written, so it’s very hard to trust what the Thunk is saying. We know AI can hallucinate and lie, but it would be really like some hard-coded guardrails so that we can at least understand what we are looking at and whether what we are looking at might have been fabricated by the AI. It seems that parts of a workflow step leak into each other which scares us a lot
Hi @Rachel thanks for looking and I understand your concern. We’re actively working on making it so more of the agent’s work can be done deterministically to reduce these kinds of errors. If you encounter anything please send it my way.