TM Manager

TM Manager

TM Manager is a tool devised to clean tmx files by removing damaged, deprecated or noisy TUs (Translation Units).

Cleaning a TM

 1. Go to TM Manage on the Console. A default starting page will open. 

image-20250724-131857.png
  1. A basic configuration of preset filters will be available to start with.

    image-20250805-073256.png
  2. It is possible to cut two types of subsets from the original tmx:

    1. Sample size: type the number of segments that will be copied from the original TMX file into the cleaned file.

    2. Test set size: type the number of segments that will be chosen randomly and extracted from the TMX file. The final, cleaned TMX will no longer include these TUs. It can be used later for automatic scores calculation.

Both settings are optional.

  1. Click on Filters to display all available rules for cleaning the tmx.

    image-20250805-072546.png

 

  1. Choose any relevant filter to be applied to clean the TM by selecting the checks (see TM Manager | Filters for a detailed description of available filters)

image-20250805-073328.png

If you select a filter configuration that you want to reuse for future cleanings, you can save the selection by clicking on Save new preset

image-20250805-073615.png

Give your congifuration a name, and a new button with your saved settings will appear in the Preset section

image-20250805-073721.png

 

 

  1. Upload the file to be cleaned by clicking the Upload files button.

image-20250805-073437.png
  1. In the Upload window:

    1. select source and target language

    2. upload the tmx file to be cleaned

    3. (Optional) upload a reference glossary

image-20250805-073900.png
  1. Click the Start button to begin the cleaning.

  2. After process is finished, a notification will be shown to download the files and you will also receive an email with a link to download the file.

image-20250724-132013.png

Cleaning Results

The cleaning results will be provided as a zip file containing:

  • clean.tmx: the cleaned tmx, i.e. the new tmx which will not contain the removed TUs

  • duplicates.txt: a file containing a list of duplicates that were deleted from the final cleaned tmx

  • filtered.xlsx: a detailed list of all TUs that were deleted from the final tmx and the reason why (i.e. the name of the filter that trigged the removal, segment by segment)

  • report.yaml: a numeric analysis of the status of the TM, detailing how many TUs were deleted and why (i.e. the name of the filter that trigged the removal, counting how many segments were removed for that reason)

  • sample.tmx: a file with copied segments (see point 3)

  • test_set.tmx: a file with the subset of TUs for automatic scoring (see step 3)

 

Filters

Filters remove TUs from the original TMX file. The following section provides a detailed description of the elements that each filter will remove.

  • Length filtering:

    • no_empty: segments with empty source or empty target

    • not_too_long: segments containing more than 512 characters

    • not_too_short: segments containing less than 3 words

    • length_ratio: segments in which one side is 70% longer than the other

  • Noise Filtering:

    • no_identical: segments with target same as source

    • no_literals: segments containing any of the following: “Re:”,”{{“,”%s”,”}}”,”+++”,”***”,’=\”

    • no_breadcrumbs: segments containing more than 2 breadcrumbs characters

    • no_repeated_words: segments containing consecutive repeated words

    • no_unicode_noise: segments containing too many characters from weird Unicode sets

    • no_space_noise: segments containing too many consecutive spaces

    • no_paren: segments containing too many parentheses or brackets

    • no-escaped_unicode: segments containing unescaped Unicode

    • no_bad_encoding: segments containing mojibake

    • no_wrong_language: segments whose source or target are not in the expected language

  • Miscellaneous:

    • no_only_numbers: segments containing more than 50% of numbers

    • no_urls: segments containing URLs

    • no_titles: segments in which all words are uppercased or tittle cased

    • no_only_symbols: segments that consist of more than 90% of non-alphabetic characters

    • no_script_inconsistencies_sl: segments containing characters from different scripts / writing system in source

    • no_script_inconsistencies_tl: segments containing characters from different scripts / writing system in target

  • Content and security filtering:

    • no-explicit text: segments containing porn-like language. Trained on porn video content

    • sensitive_data: segments containing data similar to personal data

  • Quality filtering:

    • inc_transl_sl: segments containing translation inconsistencies in source

    • inc_transl_tl: segments containing translation inconsistencies in target

    • no_number_inconsistencies: segments with mismatching numbers

    • dedup: segments containing duplicates or near duplicates of other TU

    • no_glued_words: segments containig too many upper-cased characters between lower-cased characters in source or target

  • Neural Filters:

    • lm_filter: segments with a low fluency score from language model. Considers the entire segments and it’s fluence, grammar, etc. Use with care on short segments

    • bifixer: Fixes mojibake, turns HTNK entities into characters they represent, replaces characters from wrong alphabets with the correct ones, normalizes punctuations and spaces, fixes common orthographic error for supported languages

    • bicleaner: alignment check by calculating mutual translation probability score based on parallel corpus data. Detecting noisy sentence pairs in parallel corpus