Generating Taxonomies Promptly: Practical LLM applications for human-centric taxonomy development

Every taxonomist knows the pressure of delivering the most accurate models under tight timelines and tighter budgets. Now that AI has permeated enterprise tools and strategic plans, the question isn’t whether they should use LLMs, but how to make them work most effectively in the taxonomy development cycle.

In the past year, these authors have worked on multiple projects utilizing LLMs to augment content audit, term generation, and relationship-mapping processes in the taxonomy development cycle. We have found that engagement with these tools led by taxonomists and domain experts has enhanced accuracy in the overall results, and even improved efficiency in time, cost, and/or staffing resources by up to 90%.

However, these processes still require expertise to mitigate certain “blunders” in the results, including:

Inconsistency
Unintentionality
Inaccuracy

Let’s look at how to address these problems.

Inconsistent Results

One of the most compelling use cases for LLMs in taxonomy development is the idea that they can handle data at scale, allowing more growth while expending fewer resources. However, even with prompts designed for repeatability to manage this growth, users should be prepared to encounter inconsistent results.

For example, in a project involving generating topic vocabularies for different enterprise products, one individual tool’s output varied significantly in format and quality throughout the generation process. The same prompt that produced appropriately formatted and semantically meaningful results for the first three products suddenly produced blank spreadsheet outputs on the fourth product.

To mitigate this, users should think about how and why they request specific outputs. Simplicity is your ally when designing prompt inputs for an LLM, and in our experience a multi-step prompting flow outputs faster and more accurate results than spending a lot of time trying to find a mythical one-shot prompt. This also makes it easier to adapt a prompt when needed.

Unintended Results

Even within a singular attempt to prompt an LLM for outputs, users can still experience unintended results. And the problem is not always within the prompt. Sometimes, a formatting error can cause miscommunication between humans and machines.

When using an LLM to map relationships between concepts, it is standard to request an output in a spreadsheet to make uploading it to a taxonomy management system easier. Each software can have its own requirements, so the instinct to upload an example spreadsheet to enhance the prompt is a sensible one. Yet even if a given prompt produces a table in the LLM UI formatted as 1 key with multiple values separated by commas, the spreadsheet output only displays 1:1 relationships. This requires extra time for the user either to try fixing the issue through prompting, or to manually resolve the issue in a spreadsheet editor.

Meta-prompting is often promoted as the optimal strategy to refine LLM inputs, and these authors specifically recommend reverse-prompting as a strategy. Rather than starting with LLM prompting, this method requires the user to identify aspirational outputs and then assists in producing the required input. This extra layer of development on LLM inputs assists human users in translating their intent into machine-readable inputs.

Incorrect Results

LLM consumers should be aware by now that these tools are not always accurate. Sometimes, this can be at the semantic level, such as suggesting an article on ground transportation of goods (“shipping”) should use the “maritime” tag. This can also manifest at the format level, such as asking it to output a table in a .csv file and receiving a blank file in return.

These can be the most difficult results to understand and to mitigate. Controlling one’s domain by working in an enterprise instance of the tool or by using a tool like Copilot’s notebooks, which direct the LLM to reference only a set of files that you dictate, can help provide context and disambiguate prompts. This is also one of the most important reasons to involve a domain expert in reviewing the results, because they are best positioned to identify what a correct result should look like.

How to Mitigate

Steps to mitigate these issues include:

Make sure you have the right expertise prepared to assess results.
Be prepared to edit any given prompt to improve your results.
Spend time designing not just the wording of the prompt, but also the logic of the steps you’re taking.
Don’t be afraid of multi-step prompts.
Use reverse-prompting to draft inputs.
Rigorously test the prompts before using at scale.
Direct the LLM to look at specific content to better control the domain.

By far, the best mitigation strategy is human-led design, implementation, and supervision of all things LLM, and AI more broadly. LLMs (especially those licensed and maintained as an enterprise instance) can reduce the time and mental bandwidth for many onerous taxonomy development tasks, and many organizations actively incentivize their use. However, human users still fundamentally bear the consequences of these results, whether it is a taxonomist who lets an inaccurate tag slip by or the end user who cannot find the right help article to solve their problem.

This is why these tools should work for you and not the other way around. The best workflow may look quite different from user to user, but fundamentally still needs to be driven by human expertise. How does this match up to your own experience using LLMs?