TM Manager
TM Manager is a tool devised to clean tmx files by removing damaged, deprecated or noisy TUs (Translation Units).
Cleaning a TM
1. Go to TM Manage on the Console. A default starting page will open.
A basic configuration of preset filters will be available to start with.
It is possible to cut two types of subsets from the original tmx:
Sample size: type the number of segments that will be copied from the original TMX file into the cleaned file.
Test set size: type the number of segments that will be chosen randomly and extracted from the TMX file. The final, cleaned TMX will no longer include these TUs. It can be used later for automatic scores calculation.
Both settings are optional.
Click on Filters to display all available rules for cleaning the tmx.
Choose any relevant filter to be applied to clean the TM by selecting the checks (see TM Manager | Filters for a detailed description of available filters)
If you select a filter configuration that you want to reuse for future cleanings, you can save the selection by clicking on Save new preset
Give your congifuration a name, and a new button with your saved settings will appear in the Preset section
Upload the file to be cleaned by clicking the Upload files button.
In the Upload window:
select source and target language
upload the tmx file to be cleaned
(Optional) upload a reference glossary
Click the Start button to begin the cleaning.
After process is finished, a notification will be shown to download the files and you will also receive an email with a link to download the file.
Cleaning Results
The cleaning results will be provided as a zip file containing:
clean.tmx: the cleaned tmx, i.e. the new tmx which will not contain the removed TUs
duplicates.txt: a file containing a list of duplicates that were deleted from the final cleaned tmx
filtered.xlsx: a detailed list of all TUs that were deleted from the final tmx and the reason why (i.e. the name of the filter that trigged the removal, segment by segment)
report.yaml: a numeric analysis of the status of the TM, detailing how many TUs were deleted and why (i.e. the name of the filter that trigged the removal, counting how many segments were removed for that reason)
sample.tmx: a file with copied segments (see point 3)
test_set.tmx: a file with the subset of TUs for automatic scoring (see step 3)
Filters
Filters remove TUs from the original TMX file. The following section provides a detailed description of the elements that each filter will remove.
Length filtering:
no_empty: segments with empty source or empty target
not_too_long: segments containing more than 512 characters
not_too_short: segments containing less than 3 words
length_ratio: segments in which one side is 70% longer than the other
Noise Filtering:
no_identical: segments with target same as source
no_literals: segments containing any of the following: “Re:”,”{{“,”%s”,”}}”,”+++”,”***”,’=\”
no_breadcrumbs: segments containing more than 2 breadcrumbs characters
no_repeated_words: segments containing consecutive repeated words
no_unicode_noise: segments containing too many characters from weird Unicode sets
no_space_noise: segments containing too many consecutive spaces
no_paren: segments containing too many parentheses or brackets
no-escaped_unicode: segments containing unescaped Unicode
no_bad_encoding: segments containing mojibake
no_wrong_language: segments whose source or target are not in the expected language
Miscellaneous:
no_only_numbers: segments containing more than 50% of numbers
no_urls: segments containing URLs
no_titles: segments in which all words are uppercased or tittle cased
no_only_symbols: segments that consist of more than 90% of non-alphabetic characters
no_script_inconsistencies_sl: segments containing characters from different scripts / writing system in source
no_script_inconsistencies_tl: segments containing characters from different scripts / writing system in target
Content and security filtering:
no-explicit text: segments containing porn-like language. Trained on porn video content
sensitive_data: segments containing data similar to personal data
Quality filtering:
inc_transl_sl: segments containing translation inconsistencies in source
inc_transl_tl: segments containing translation inconsistencies in target
no_number_inconsistencies: segments with mismatching numbers
dedup: segments containing duplicates or near duplicates of other TU
no_glued_words: segments containig too many upper-cased characters between lower-cased characters in source or target
Neural Filters:
lm_filter: segments with a low fluency score from language model. Considers the entire segments and it’s fluence, grammar, etc. Use with care on short segments
bifixer: Fixes mojibake, turns HTNK entities into characters they represent, replaces characters from wrong alphabets with the correct ones, normalizes punctuations and spaces, fixes common orthographic error for supported languages
bicleaner: alignment check by calculating mutual translation probability score based on parallel corpus data. Detecting noisy sentence pairs in parallel corpus