Data conversion for technical publications in XML and SGML

Home / A and D team / services / data conversion

Data conversion services

There are many reasons you might want to convert data from the system it is currently in, to something else: customer requirement; to better leverage content; become standards-compliant; re-use or re-purpose your data; and many more.

However, is it cost effective? How easy, or difficult will it be? Will I need to start from a certain format? Will I lose any of my content, or metadata? What can you do to help?

What will it cost me to convert my data?

We have a saying in Great Britain – “How long is a piece of string?” It is meant to emphasise that you don’t always know all the answers at the beginning. Every data conversion project is a bit like this. There are many questions that need answering before being able to give a definitive answer on whether converting your data will be cost effective.

There are lots of variables you need to consider: what form is my data in to start; how consistent is it; what does it need to become; how soon do I need it converted; how much resource do I already have to do some of the work?

We approach every conversion project in a consistent manner so you can be sure you have all the information to make a decision about the best way forward. Depending on the quantity of data and the time available, your best option may be to re-write the data from scratch, or copy and paste it into the new format. However, if you need the data converted quickly then you will need some sort of processing intervention.

By looking at all these factors, you should be able to measure, ‘how long is your piece of string!’

Resources

Contact us

Typical phases in a content migration project

These are typically not undertaken sequentially, but iteratively. Only the final phase, the migration itself, is undertaken after all the other steps have been completed.

Phase 1: Analyse source content and build information model

The first phase involves Contiem specialists building an information model with your team. This enables clear target structure mappings to be defined between your existing content and the equivalent target structures.

From the information model, we help you develop a definitive set of test content. The target is to produce a representative sample that includes all relevant mapped structures i.e. every topic/section type, element, image format, and different sequences of those items, where that could make a difference to the output. This will ensure repeatable, consistent testing of the migration script(s) developed later in the conversion process, is possible.

We produce mocked-up and marked-up conversion output, as well as a complete specification of the mapping between the legacy content structures and the new structures.

Phase 2: Add contextual cues where required and clean up legacy content

Where the source content is unstructured, or a legacy schema is less constrained than the target schema for migration, it is sometimes helpful to add extra semantic information and clean up the legacy content, so it maps more easily and systematically to the target structures.

A simple example is that if you wish to convert a DocBook section to several S1000D data modules, you would first open out the DocBook section into several nested section files, structure them in a way that corresponds to S1000D tasks, and then name them with a “d-” prefix so that the conversion script could take the appropriate action to convert them to data modules.

This process can only be undertaken once a draft information model has been defined and once authors have access to the legacy content to be able to make such changes.

Phase 3: Develop and test migration scripts

The last phase of the conversion process involves developing a customized, often XSLT-based migration script, using the mappings and information model developed in the earlier phases. This is a highly iterative process and the script will make use of the previously defined sample test input, along with the sample marked up output for comparative purposes. The output may be the same dedicated S1000D maps that were used to test publishing. In which case, the input will be legacy structures that are expected to be mapped to those target data modules, including any necessary extra contextual cues.

When Contiem develops these scripts for our clients, using the initial content sample helps to get the development process started, but it is vital to also test the process with real content. There will always be new structures or sequences that have not previously been accounted for, and the script will need to be adjusted to reflect this. Doing such custom script development leads to a high-quality conversion that preserves the semantics of the original content and the reuse within that content. A good conversion script will only convert each reused topic once, and will map legacy reuse patterns to idiomatic target architectural patterns.

Phase 4: Schedule conversion according to clusters of reuse

It is rare for an organization to be able to convert all content at the same time. It is unlikely that every group of authors will be ready to move to the new architecture at the same time. However, to maximise reuse and simplify script development, it is helpful to convert content in batches according to the clusters of reuse within that content. For example, it might make sense to convert all maintenance manuals at one time, due to some reuse between them, and convert operator information, or parts data in subsequent batches. It is possible to migrate individual documents from a set like this, and others from the same set later, but it requires developing a more complex conversion tool that can store the paths or IDs of files already converted. With this approach of course, there is always a potential risk that reused content may be modified between a first and subsequent import.

Major target structure mappings

Map structure

Structuring publications consistently so that all output requirements can be handled and other authors can reuse the content efficiently.

Topic / section types

Specify types of information chunk to provide consistent, clear information for users as well as easier processing and management.

Element usage

Define which elements will be used within sections in a consistent and easy to process way.

Object names and metadata

Name and tag files/database objects appropriately.

Image formats and usage

Specify the file formats and expected dimensions of images.

Reusing content

Define target structures for reusable modules and snippets. Priorities are consistency, accuracy, and maintainability.

Word to XML

Microsoft Word to XML is one of the most common, yet most problematic conversions. Word content is almost always customised by the end user beyond using styles and template, to gain the required look and feel of the final document. This requires special skills to handle not only the style formatting, but also any additional visual elements applied to the text.

We have a specialist resource for converting Word content.

Further information

Please contact us using the form at the bottom of this page or via contact us page

Contact us now

Fill in the form below with a brief message about your requirement and one of our team will be in touch to help.

Data conversion services

What will it cost me to convert my data?

Resources

Related pages:

Latest blog posts:

Typical phases in a content migration project

Phase 1: Analyse source content and build information model

Phase 2: Add contextual cues where required and clean up legacy content

Phase 3: Develop and test migration scripts

Phase 4: Schedule conversion according to clusters of reuse

Major target structure mappings

Map structure

Topic / section types

Element usage

Object names and metadata

Image formats and usage

Reusing content

Word to XML

Further information

Contact us now