About the Project
A long-time Factor client, a Fortune 100 technology company, engaged us to design and execute a proof of concept to demonstrate auto-tagging of a collection of sales and marketing materials. The PoC consisted of the following steps:
- A review of the current state of auto-tagging technology and commercial offerings.
- Platform selection and implementation.
- Processing and ingestion of a representative test corpus.
- Testing and optimization of auto-tagging using a set of corporate taxonomies.
- Go/no-go decision by the client.
Challenge
Consistent application of metadata to content is an essential part of content management. It’s a requirement for findability, through both search and browsing applications, and it’s a necessary enabling capability for analytics, AI, personalization and targeting, and content governance. Our client faced a number of challenges associated with content tagging:
- In the current, manual, tagging workflow the client found that human taggers spent ~20 minutes per asset. Manual tagging of thousands of sales and marketing documents and web pages was a massive effort.
- The manual workflow had persistent problems with tagging completeness, consistency, and accuracy. Inconsistency between taggers, in particular, was a challenge.
- It was proposed that auto-tagging with human-in-the-loop QC would improve overall tagging quality and be a scalable tagging solution.
Factor’s Approach
The overall goal of the PoC was to answer three questions about auto-tagging:
- Does it work? Can a sample content corpus be tagged using specific taxonomies?
- Does it work well? Is auto-tagging good quality, can quality be improved, and does it support input of rules from non-technical users?
- Can it be implemented? Does it integrate with existing systems and support current workflows?
The initial phase of the PoC consisted of a technology and vendor review, culminating in recommendation of an auto-tagging platform. The second phase was the implementation of an auto-tagging test environment and exploration of auto-tagging capabilities.
The technology review and platform selection process entailed, over a three to four month period:
- Initial development of a set of baseline requirements for an auto-tagging solution. Integration with a taxonomy tool, rule-based tagging capabilities, and explainability were among key requirements.
- Vendor and tool review, including documentation review, interview/assessment sessions, and platform demos. A total of eight options were assessed, including two custom, in-house applications and six platforms from commercial vendors.
- A vendor offering a platform consisting of a taxonomy and ontology manager with auto categorization capabilities and a text annotation tool was selected for further evaluation.
Selection of the platform was followed by implementation and testing for another three to four months:
- Platform setup.
- Selection, processing, and ingestion of test content and taxonomies.
- Initial, naive auto-tagging and analysis in terms of tagging accuracy and completeness.
- Iterative fine tuning with tagging rules and taxonomy modifications, with continual quality tracking.
Solution
- A text extraction and upload pipeline built from the Beautiful Soup Python package and additional Python scripts.
- A set of organizational taxonomies describing the client’s products, as well as topical taxonomies for technology, subject, and industry.
- An auto-tagging test environment built from a taxonomy and ontology management platform with auto categorization capabilities and a text annotator.
- Reporting on annotation raw data and comparison to ground truth (standard data with tagging considered to be correct).
What We Learned
Auto-tagging was shown to work well and can be iteratively improved in a test environment. It has potential as a scalable tagging solution. However, there are some important considerations:
- It is not a plug and play solution and organizational readiness is key. A large-scale auto-tagging program requires significant IT resources, supporting taxonomies, and mature content governance.
- The OOTB tagging and analysis experience is adequate for testing but not for general use in the envisioned final workflow.
- Typical enterprise taxonomies are not ideal for auto-tagging and we found that ~25% of terms required a rule around its use. Rules were created with regex, and identification and troubleshooting of tagging errors was time consuming. Homographs were especially troublesome, as were artifacts caused by the tokenization process (For example, train and training, hospital and hospitality, and so on). Fit for purpose auto-tagging vocabularies, with large numbers of synonyms, and tuning stemming algorithms, stop word lists, and other text processing steps are likely to reduce the need for specific rules and produce better quality auto-tagging.
- The document processing and ingestion pipeline needs to be solid, and may require significant customization. Document types and content formats need to be considered and fine tuning may be necessary.
- Content publishing workflows need to be understood and controlled to determine how and when auto-tagging is applied. Mature content governance is required.
End Result
The client was satisfied with the tagging results but realized that integration of auto-tagging into the content creation workflow required additional integration and engineering resources. The client was also in the process of acquiring a new content pipeline management tool. As a result they purchased the taxonomy management component of the auto-tagging platform and planned to begin auto-tagging integration after the new content pipeline management tool was implemented.