I came across this interesting question recently; coincidentally, twice.
In the first instance, a content marketer in a community I am a member of posed the question of whether, after performing keyword research and building content pillars, an ai system could be trusted to effectively “commission” pieces from the extensive list of possible articles they had built.
can you using ai and large language models to plan content, pick stories or drive an editorial agenda?
our verdict at the time of publication:
no
The second instance was in my own work, while building an MVP of the automated daily news summary tool which eventually matured into a full-blown ai podcast. The first iteration of this tool used a whitelist of possible news sources, from which the ai workflow was tasked with selecting the best stories every morning.
As it turned out, the content marketer and myself where having exactly the same problem – no matter how many guardrails we put in place for the ai (I was using GPT4, they were using Gemini), it made either poor, or completely incorrect decisions.
This is a great microcosm of where large language models are at right now from a productivity standpoint. They excel in two categories:
- narrowly-scoped single-step tasks with a defined outcome (e.g. “look at this list of cities and append the correct country to each city, separating each with a comma.”)
- broad creative tasks with non-mission-critical outputs (e.g. “read this article and suggest a list of 10 catchy titles.”
Editorial decisions are right in the middle of these two tasks; narrowly scoped (choose five stories from this list of 20), but with unmeasurable and highly creative outcomes. How do you teach an ai what stories are important? I tried a few different prompting methods:
Looking at the available stories, take into account the prominence of the companies involved, the stature of the people quoted, and the monetary figures involved, and decide which story is the most important
Perform a Google News search for each headline, and record the number of related stories for each. Use this number as a measure of the story’s important
Use a combination of on-page signals to decide the story’s importance. These include:
- The word count of the story
- If available, the number of people who have read the story and the social share count
- The number of comments
However none of these yielded good results – minor stories would frequently fall into the headlines section and big stories would be relegate below the fold.
A surprising issue was GPT4’s inability to detect duplicate stories. If two publications ran the same story the workflow would pick it up twice, and cover it twice in the briefing. I had best results with this prompt:
Before you add a story to the briefing, check the stories already included to make sure it is not a duplicate. Specifically look at:
- Company names
- Any quotes and the people they are attributed to
- Any figured or financial details included
Act cautiously and if there is a possibility the story is a duplicate, discard it and move onto the next story.
Even with this prompt included, this workflow covered a story about Perplexity’s funding twice in a single briefing – even including the same figures ($1 billion valuation, $62.7 million in funding), names of investors, and founders. It’s possible the model was interpreting the prompt as a list of requirements (i.e. a story is only a duplicate if it includes the same companies, quotes and figures), but I think normal LLM errors are a more likely explanation.
The content marketer I mentioned before was having similar issues – they had two lists of platforms and features which they were trying to combine but only keep the combinations that made sense. Imagine you have two datasets like this:
Ham | Pizza |
Apple | Sandwich |
Chocolate | Cookie |
Cinnamon | Butter |
Peanut | Roast |
You have 25 possible combinations here, some of which are perfect (chocolate cookie, peanut butter, ham roast), some of which are odd but could make sense (apple cookie, peanut roast), and some of which are clearly bonkers (ham butter!).
Now imagine each list is 500 items. To me this is a great use-case for a large language model. After discussion, the community came up with a prompt that attempted to arrange the data into three categories:
- Valid
- Borderline (for human review)
- Invalid
It included guardrails instructing the ai to see if the phrase existed to a reasonable degree in its corpus, and if the phrase is present in popular literature. If unsure, it could try performing a web search to see how many results it returned or if Google suggested something else.
However, similarly to my editorial tasks, the workflow fell down again and again, producing errors which, if it had followed its prompt, would not have occurred. This was also true with the model’s temperature turned all the way down to reduce random outcomes.
So, for now, it appears editorial tasks that involve judgement, context and common sense will remain best completed by humans.
Posted by
About the author