Go to the first, previous, next, last section, table of contents.
The incoming explanations of the internals of recode should
help people who want to dive into recode sources for adding new
charsets. Adding new charsets does not require much knowledge about
the overall organization of recode. You can rather concentrate
of your new charset, letting the remainder of the recode
mechanics take care of interconnecting it with all others charsets.
If you intend to play seriously at modifying recode, beware
that you may need some other GNU tools which were not required when
you first installing recode. If you modify or create any
`.l' file, then you need flex, and some better awk
like mawk, GNU awk, or nawk. If you modify
the documentation (and you should!), you need GNU makeinfo.
If you are really audacious, you may also want Perl for modifying the
RFC 1345 processing, and GNU m4 and GNU Autoconf for adjusting
configuration matters.
The recode mechanics slowly evolved for many years, and it
would be tedious to explain all problems I met and mistakes I did all
along, yielding the current behavior. Surely, one of the key choice
was to stop trying to do all conversions in memory, one line or one
buffer at a time. It is far fruitful to use the character stream
paradigm, and the elementary recoding steps now convert a whole stream
to another. Most of the control complexity in recode exists
so that each elementary recoding step stays simple, making easier
to add new ones. The whole point of recode, as I see it, is
providing a comfortable nest for growing new charset conversions.
The main recode driver constructs, while initializing all
conversion modules, a table giving all the conversion routines
available (single steps) and for each, the starting charset and
the ending charset. If we consider these charsets as being the nodes
of a directed graph, each single step may be considered as oriented
arc from one node to the other. A cost is attributed to each arc:
for example, a high penalty is given to single steps which are prone
to losing characters, a lower penalty is given to those which need
studying more than one input character for producing an output
character, etc.
Given a starting code and a goal code, recode computes the most
economical route through the elementary recodings, that is, the best
sequence of conversions that will transform the input charset into the
final charset. To speed up execution, recode looks for
subsequences of conversions which are simple enough to be merged, it
then dynamically creates new single steps to represent these mergings.
For example, suppose that four elementary steps were selected at path
optimization time. Then recode will split itself into four
different tasks interconnected with pipes, logically equivalent to:
step1 <input | step2 | step3 | step4 >output
The splitting into subtasks is usually done using pipe(2) or
popen(3). But the splitting may also be completely avoided,
and rather simulated by using intermediate files. The various
`--sequence=strategy' options (see section How to use this program)
gives you control over the flow methods, by replacing strategy
with `pipe', `popen' or `files'.
A double step in recode is a special concept representing
a sequence of two single steps, the output of the first single step
being the special charset RFC 1345, the input of the second
single step being also RFC 1345. Special recode
machinery dynamically produces efficient, reversible, merge-able
single steps out of these double steps.
The main part of recode is written in C, as are most single
steps. A few single steps need to recognize sequences of multiple
characters, they are often better written in Flex. It is easy for a
programmer to add a new charset to recode. All it requires
is making a few functions usually kept in a single `.c' file,
adjusting `Makefile.in' and remaking recode.
One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not lose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not lose too much information while converting.
If, for any of these two functions, you have to read multiple bytes of
the old charset before recognizing the character to produce, you might
prefer programming it in flex in a separate `.l' file.
Prototype your C or flex files after one of those which exist
already, so to keep the sources uniform. Besides, at make time,
all `.l' files are automatically merged into a single big one by
the script `mergelex.awk'.
There are a few hidden rules about how to write new recode
modules, which allow the creation of `initstep.h' at make
time, or the proper merging of all Flex files. Mimetism is a simple
approach which relieves me of explaining all these rules! Start with a
module closely resembling what you intend to do. Here is some advice
for picking up an example. First decide if your new charset module is
to be be driven by algorithms rather than by tables. For algorithmic
recodings, see `iconqnx.c' for C code, or `txtelat1.l'
for Flex code. For table driven recodings, see `ebcdic.c' for
one-to-one style recodings, `lat1html.c' for one-to-many style
recodings, or `atarist.c' for double-step style recodings. Just
select an example from the style that better fits your application.
Each of your source files should have its own initialization function,
named module_charset, which is meant to be executed
quickly once, prior to any recoding. It should declare the
name of your charsets and the single steps (or elementary recodings)
you provide, by calling declare_step one or more times.
Besides the charset names, declare_step expects a description
of the recoding quality (see `recode.h') and two functions you
also provide.
The first such function has the purpose of allocating structures,
preconditioning conversion tables, etc. It is also the usual way of
further modifying the STEP structure. This function is executed
only if and when the single step is retained in an actual recoding
sequence. If you do not need such delayed initialization, merely use
NULL for the function argument.
The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:
file_one_to_one, while having a delayed initialization for
presetting the STEP field one_to_one to the predefined
value one_to_same.
file_one_to_one, while having a delayed initialization
for presetting the STEP field one_to_one with your table.
file_one_to_many, while having a delayed initialization
for presetting the STEP field one_to_many with your table.
If you have a recoding table handy in a suitable format but do not use
one of the predefined recoding functions, it is still a good idea to use
a delayed initialization to save it anyway, because recode option
-h will take advantage of this information when available.
Finally, edit `Makefile.in' to add the source file name of your
routines to the C_STEPS or L_STEPS macro definition,
depending on the fact your routines is written in C or in flex.
For C files only, also modify the STEPOBJS macro definition.
Go to the first, previous, next, last section, table of contents.