PipelneArchitecture Considerations in Internationalized, Automatically Translated Endeca Implementations
Kyle Hipke, Cirrus10 Lead Developer
This document explains the architectural considerations in running an internationalized Endeca site. First, it explains the various data modeling approaches (i.e. how internationalized product and non-product content should be stored and retrieved in Endeca). Second, it discusses how to integrate Endeca with an automatic translation service (like Smartling) in a situation where the merchandising team will only author content in English but the content will be translated into other languages by the service.
Internationalized Endeca Architecture
In an internationalized Endeca implementation, there are two main pieces to think about: product data and non-product content. In Endeca, product data lives in a service called MDEX, which returns product and refinement data in response to search queries. On the other hand, non-product content (like Experience/Rule Manager content, thesaurus entries, and keyword redirects) lives in Workbench. In some situations, some non-product content might also live in an external system (like ATG). For the purposes of this section, though, the presence of a system like ATG does not heavily factor into the decision making process.
Overall, the MDEX architecture is the main decision point. The architecture of Workbench follows from that decision.
MDEX ArchitectureThere are two main options for deciding how to store all of the product data for all of the various languages you want to support. Each option has its own pros and cons. The best option depends on your requirements and long-term goals.
Option 1 – One MDEX for All Languages
In this option, there’s one MDEX and all properties for all languages are stored on the same record or split across different records. For example:
MDEX Record 1
Another possibility for storing data in this approach is like so:
MDEX Record 1
MDEX Record 2
Option 2 – One MDEX Per Language
In this option, there’s one MDEX for every language you want to support. For example:
MDEX-en Record 1
MDEX-fr Record 2
Due to Endeca’s linguistic processing capabilities, the major deciding factor is which specific languages that you want to support and how many different languages you want to support.
In order to take advantage of Endeca’s language features, if you plan on supporting certain languages, including Japanese, Korean, Chinese or German, you should go with Option 2 . The full list of languages for which you need to use Option 2 can be found in the Oracle Guided Search Internationalization Guide. If you wish to support any language for which “Only Supported Language Analyzer” is “OLT”, you will need to go with Option 2.
Furthermore, if you plan on supporting a large number of languages (“more than 3” being a general rule of thumb), it will tend to be better to use Option 2. If you have a large amount of languages in a single MDEX, you may run into issues when managing Rule/Experience Manager content, keyword redirects, thesaurus entries and stop words. Not only is it cumbersome to manage everything in one big list, but you may have a word (in a redirect or thesaurus entry) that means different things in different languages. In German, for example, “Gift” means “poison”, but “gift” in English means “a present”.
If you only plan on supporting a small number of languages and don’t support any languages for which “OLT” is the only analyzer (see the table linked to above), Option 1 would tend to be better. With Option 2, you will usually require more hardware to handle the larger number of MDEX engines and to handle load balancing. So, if you can fit everything on a single MDEX, you can avoid that cost.
If you’re planning on using an automatic translation service and saving translated content back to Endeca and/or ATG (which will be discussed later), Option 3 is probably best. This is because, if you have everything in one MDEX, your Experience/Rule Manager/ATG, redirects, thesaurus entries and stop words may become cluttered and unmanageable as content becomes translated and fills up Workbench or ATG. If you’re only authoring content in English but you start seeing content from other languages alongside your English content, you’ll have a harder time working on and organizing your content.
Endeca Workbench Architecture
If using option 1 (one MDEX containing all languages), all content will exist under a single application in Workbench. In other words, when you look at the thesaurus screen, you’ll see all entries for all languages in the list. Additionally, all of your Experience/Rule Manager content for all languages will appear alongside each other in the Experience/Rule Manager editor. However, depending on your needs, it’s possible to organize Experience/Rule Manager content to make it easier to manage. For example, you could arrange it so that content created for separate languages appears in separate folders.
If using option 2, Workbench is much simpler. Every language will have its own application and all content will be completely separate. To work on content for a specific language, you can simply select the corresponding application in the Workbench application list.
Integrating With An Automatic Translation Service
You can make use of an automatic translation service (like Smartling) so that your merchandising team can create content in English but have that content be translated and served to visitors in other languages. This section discusses the options for using a service like this with Endeca (and, optionally, ATG).
There are two approaches for integrating a translation service with Endeca. They primarily differ in search performance, control/flexibility and cost.
Approach 1 – Global Delivery Network (GDN)
In this approach, the automatic translation service sits between your English site and your visitors. When a visitor attempts to visit a non-English version of your site, the translation service gets the English version of the page, translates it and displays the translated version to the user. When a customer enters a search, the search is translated into English and performed on the English site, then the results page is translated into the visitor’s language.
This approach has the advantage of being faster and cheaper to implement than the alternative. This is because very little work is needed on the Endeca side to support this. Also, you only need one MDEX. The discussion in the first part of this document doesn’t apply because you’re only going to be running one English MDEX.
The disadvantages of this approach (in comparison with the other approach) have to do with search performance, control and flexibility.
Search relevance (customers being able to find what they’re looking for) is worse because there’s always going to be some amount of error introduced when translating a search to English and translating the results back to the visitor’s language. The actual results list or other Endeca-driven pages might contain errors as well. Since the translated content is not in your own system, you can’t fix translation issues. You have to rely on the service’s translation. Additionally, if you have any products in your catalog that have some data in different languages, the search will miss out on those matches. For example, if some of your products have Chinese in their descriptions, a Chinese visitor searching on something that matches one of those descriptions will not necessarily see the matching product (unless the search they used, when translated to English, matches on some other English data on that product, like if the description is a mix of English and Chinese).
Control is worse because you can only work with the English content. For example, if you noticed that visitors from different locales tended to be more interested in different products, you would not be able to customize the search results page for those different locales. Similarly, if you noticed issues with some of your content or some of your searches for a specific locale, you can’t as easily fix them individually without affecting all the other locales. Overall, you just don’t have the ability to do merchandising on a locale-by-locale basis and you have a reduced ability to fix locale-specific problems as they come up.
Flexibility is worse because none of the translated content is in Endeca, Workbench or any of your own systems (like your product catalog). Basically, you give up the ability to just do what you want with the translated content. Even if you don’t care to do locale-specific merchandising now, this approach doesn’t have the flexibility to let you do that in the future. However, given the relative cheapness of implementing this approach, it’s feasible to simply switch to approach 2 if your needs change later on.
Approach 2 – Translation API
In this approach, all translated content is stored in your own systems. When content is created in your product catalog, ATG, etc.,a process runs which makes use of the translation service’s API to translate the content. The translated content is then saved back to the system in which the original content was created, along with the English content.
This approach is significantly more expensive to implement and maintain. It’s also significantly slower to implement. This is because of the requirement to automatically translate and store the translated content. A process needs to be developed for your product catalog and/or any external systems (like ATG) which will query the translation API and save the results. These processes won’t be simple. They will need to have logic to handle conflicts (What if someone has made changes to the Chinese version of some translated content but somebody edits the original English version as well?) and a careful process and careful timing to handle passing data between all the systems involved. Hardware resources need to be allocated to store that translated content. Workbench content doesn’t take up much space, but for a product catalog or external system, it would (in the worst case) effectively multiply the amount of storage required by the number of languages you want to support.
The advantages of this approach are search performance, control and flexibility.
By storing and indexing the translated product information, you can take advantage of Endeca’s language processing features to improve the relevancy of results for non-English searches. This helps customers more easily find what they are looking for because their search is performed directly against the translated product list and may take advantage of Endeca’s linguistic features (like decompounding, stemming, and normalization), unlike the other approach where their search is translated to English and performed against the English product list. Additionally, you can adjust the relevancy logic on a locale-by-locale basis, tuning the search results to better serve visitors from each locale.
Because all of the translated content is stored in your own systems, you can do whatever you want with it. It gives you the flexibility to fine-tune the merchandising and search behavior for each individual locale. That said, whatever you end up wanting to do with all of your translated content, it’s going to take development effort. This approach just makes it more feasible to build locale-specific functionality. But with the Global Delivery Network, those sorts of things aren’t even an option (not without switching approaches).
Furthermore, this approach enables you to fix translated content whenever you notice issues. You could develop a workflow where a reviewer randomly selects from among your most frequently viewed content and checks it for errors, then fixes those errors directly in the translated content.
With this approach, you will still need to decide between the MDEX architecture options discussed earlier in the document.
The first thing to consider is your requirements. What do you currently, absolutely need out of this solution?
If you know that, right now, you absolutely need the ability to do merchandising and search tuning on a locale-by-locale basis, then the only way you can have that is using the Translation API approach.
If you want the search relevancy to be as high as possible, the Translation API approach achieves that. However, it’s important to consider how MUCH better the search relevancy would be vs. using the GDN. We can’t know an exact difference without A/B testing, but, due to the fact that the translation tool will have translation errors regardless of the approach, we would suspect the difference between the approaches’ search relevance is not necessarily significant. In the GDN approach, errors may be introduced when translating a customer’s search terms and translating the results list page. In the API approach, errors may be introduced when translating the product catalog. Take this into consideration and think about your customer’s tendencies and your most popular search terms and how error-prone translations of those searches might be. In sum, if you want the best possible search relevancy regardless of the tradeoffs, the Translation API approach works. But, if you’re conscious of the tradeoffs, the GDN approach may not be much worse.
If you don’t absolutely need any of the above-mentioned things, then you open up the possibility of using the GDN approach. Since the impact on search relevance will vary depending on your locales, customers, and the specifics of your site, there’s no way to know for sure how much better or worse the various approaches will be unless you try them out (though we suspect, for most cases, it won’t be a very large difference). Thankfully, it’s feasible to try the cheaper GDN approach first but leave the option to go with the more expensive API approach open, should you discover GDN isn’t good enough for your needs.
Consider the fact that the GDN approach requires such a small amount of time and effort. If you don’t, right now, absolutely need certain features exclusive to the API approach, it will probably make the most sense if you go with the GDN approach first. If, later, you desire the ability to do locale-specific merchandising and search tuning or you aren’t satisfied with the search relevance for non-English locales, you can consider switching to the API approach. Because the GDN approach is so cheap relative to the API approach, you don’t lose much by trying it out first, even if you eventually decide to switch to the API approach. Furthermore, you can use the information and experience you’ve gained in using the GDN approach to inform the design, development and maintenance of the API approach.