Saturday, 31 March 2012

XML

Extensible Markup Accent (XML) is a markup accent that defines a set of rules for encoding abstracts in a architecture that is both human-readable and machine-readable. It is authentic in the XML 1.0 Specification4 produced by the W3C, and several added accompanying specifications,5 all chargeless accessible standards.6

The architecture goals of XML accent simplicity, generality, and account over the Internet.7 It is a textual abstracts architecture with able abutment via Unicode for the languages of the world. Although the architecture of XML focuses on documents, it is broadly acclimated for the representation of approximate abstracts structures, for archetype in web services.

Many appliance programming interfaces (APIs) accept been developed for software developers to use to action XML data, and several action systems abide to aid in the analogue of XML-based languages.

As of 2009, hundreds of XML-based languages accept been developed,8 including RSS, Atom, SOAP, and XHTML. XML-based formats accept become the absence for abounding office-productivity tools, including Microsoft Office (Office Accessible XML), OpenOffice.org (OpenDocument), and Apple's iWork.9 XML has aswell been active as the abject accent for advice protocols, such as XMPP.

Key terminology

The actual in this area is based on the XML Specification. This is not an all-embracing account of all the constructs which arise in XML; it provides an addition to the key constructs a lot of generally encountered in circadian use.

(Unicode) Character

By definition, an XML certificate is a cord of characters. Almost every acknowledged Unicode appearance may arise in an XML document.

Processor and Application

The processor analyzes the markup and passes structured advice to an application. The blueprint places requirements on what an XML processor have to do and not do, but the appliance is alfresco its scope. The processor (as the blueprint calls it) is generally referred to colloquially as an XML parser.

Markup and Content

The characters which accomplish up an XML certificate are disconnected into markup and content. Markup and agreeable may be acclaimed by the appliance of simple syntactic rules. All strings which aggregate markup either activate with the appearance < and end with a >, or activate with the appearance & and end with a ;. Strings of characters which are not markup are content.

Tag

A markup assemble that begins with < and ends with >. Tags appear in three flavors:

start-tags; for example:

end-tags; for example:

empty-element tags; for example:

Element

A analytic certificate basic either begins with a start-tag and ends with a analogous end-tag or consists alone of an empty-element tag. The characters amid the start- and end-tags, if any, are the element's content, and may accommodate markup, including added elements, which are alleged adolescent elements. An archetype of an aspect is Hello, world. (see accost world). Another is .

Attribute

A markup assemble consisting of a name/value brace that exists aural a start-tag or empty-element tag. In the archetype (below) the aspect img has two attributes, src and alt: Foligno Madonna, by Raphael. Another archetype would be Connect A to B. area the name of the aspect is "number" and the amount is "3".

Characters and escaping

XML abstracts abide absolutely of characters from the Unicode repertoire. Except for a baby amount of accurately afar ascendancy characters, any appearance authentic by Unicode may arise aural the agreeable of an XML document. The alternative of characters that may arise aural markup is somewhat added bound but still large.

XML includes accessories for anecdotic the encoding of the Unicode characters that accomplish up the document, and for cogent characters that, for one acumen or another, cannot be acclimated directly.

Valid characters

Unicode cipher credibility in the afterward ranges are accurate in XML 1.0 documents:10

U+0009, U+000A, U+000D: these are the alone C0 controls accustomed in XML 1.0;

U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);

U+10000–U+10FFFF: this includes all cipher credibility in added planes, including non-characters.

XML 1.111 extends the set of accustomed characters to cover all the above, additional the actual characters in the ambit U+0001–U+001F. At the aforementioned time, however, it restricts the use of C0 and C1 ascendancy characters added than U+0009, U+000A, U+000D, and U+0085 by acute them to be accounting in able anatomy (for archetype U+0001 have to be accounting as or its equivalent). In the case of C1 characters, this brake is a backwards incompatibility; it was alien to acquiesce accepted encoding errors to be detected.

The cipher point U+0000 is the alone appearance that is not acceptable in any XML 1.0 or 1.1 document.

Well-formedness and error-handling

The XML blueprint defines an XML certificate as a argument that is well-formed, i.e. it satisfies a account of syntax rules provided in the specification. The account is adequately lengthy; some key credibility are:

It contains alone appropriately encoded acknowledged Unicode characters.

None of the appropriate syntax characters such as "<" and "&" arise except if assuming their markup-delineation roles.

The begin, end, and empty-element tags that circumscribe the elements are accurately nested, with none missing and none overlapping.

The aspect tags are case-sensitive; the alpha and end tags have to bout exactly. Tag names cannot accommodate any of the characters !"#$%&'()*+,/;<=>?@\^`{|}~, nor a amplitude character, and cannot alpha with -, ., or a numeric digit.

There is a individual "root" aspect that contains all the added elements.

The analogue of an XML certificate excludes texts that accommodate violations of well-formedness rules; they are artlessly not XML. An XML processor that encounters such a abuse is appropriate to address such errors and to cease accustomed processing. This policy, occasionally referred to as draconian, stands in notable adverse to the behavior of programs that action HTML, which are advised to aftermath a reasonable aftereffect even in the attendance of astringent markup errors.16 XML's action in this breadth has been criticized as a abuse of Postel's law ("Be bourgeois in what you send; be advanced in what you accept").17

Schemas and validation

In accession to getting well-formed, an XML certificate may be valid. This agency that it contains a advertence to a Certificate Type Definition (DTD), and that its elements and attributes are declared in that DTD and chase the grammatical rules for them that the DTD specifies.

XML processors are classified as acceptance or non-validating depending on whether or not they analysis XML abstracts for validity. A processor that discovers a authority absurdity have to be able to address it, but may abide accustomed processing.

A DTD is an archetype of a action or grammar. Since the antecedent advertisement of XML 1.0, there has been abundant plan in the breadth of action languages for XML. Such action languages about constrain the set of elements that may be acclimated in a document, which attributes may be activated to them, the adjustment in which they may appear, and the acceptable parent/child relationships.

Document Type Definition

The oldest action accent for XML is the Certificate Blazon Definition (DTD), affiliated from SGML.

DTDs accept the afterward benefits:

DTD abutment is all-over due to its admittance in the XML 1.0 standard.

DTDs are abrupt compared to element-based action languages and appropriately present added advice in a individual screen.

DTDs acquiesce the acknowledgment of accepted attainable article sets for publishing characters.

DTDs ascertain a certificate blazon rather than the types acclimated by a namespace, appropriately alignment all constraints for a certificate in a individual collection.

DTDs accept the afterward limitations:

They accept no absolute abutment for newer appearance of XML, a lot of chiefly namespaces.

They abridgement expressiveness. XML DTDs are simpler than SGML DTDs and there are assertive structures that cannot be bidding with approved grammars. DTDs alone abutment abecedarian datatypes.

They abridgement readability. DTD designers about accomplish abundant use of constant entities (which behave about as textual macros), which accomplish it easier to ascertain circuitous grammars, but at the amount of clarity.

They use a syntax based on approved announcement syntax, affiliated from SGML, to call the schema. Typical XML APIs such as SAX do not attack to action applications a structured representation of the syntax, so it is beneath attainable to programmers than an element-based syntax may be.

Two appropriate appearance that analyze DTDs from added action types are the syntactic abutment for embedding a DTD aural XML abstracts and for defining entities, which are approximate bits of argument and/or markup that the XML processor inserts in the DTD itself and in the XML certificate wherever they are referenced, like appearance escapes.

DTD technology is still acclimated in abounding applications because of its ubiquity.

XML Schema

A newer action language, declared by the W3C as the almsman of DTDs, is XML Schema, generally referred to by the abridgement for XML Action instances, XSD (XML Action Definition). XSDs are far added able than DTDs in anecdotic XML languages. They use a affluent datatyping arrangement and acquiesce for added abundant constraints on an XML document's analytic structure. XSDs aswell use an XML-based format, which makes it accessible to use accustomed XML accoutrement to advice action them.

RELAX NG

RELAX NG was initially defined by OASIS and is now aswell an ISO/IEC International Standard (as allotment of DSDL). RELAX NG schemas may be accounting in either an XML based syntax or a added bunched non-XML syntax; the two syntaxes are isomorphic and James Clark's Trang about-face apparatus can catechumen amid them after accident of information. RELAX NG has a simpler analogue and validation framework than XML Schema, authoritative it easier to use and implement. It aswell has the adeptness to use datatype framework plug-ins; a RELAX NG action author, for example, can crave ethics in an XML certificate to accommodate to definitions in XML Action Datatypes.

ISO DSDL and other schema languages

The ISO DSDL (Document Action Description Languages) accepted brings calm a absolute set of baby action languages, anniversary targeted at specific problems. DSDL includes RELAX NG abounding and bunched syntax, Schematron affirmation language, and languages for defining datatypes, appearance repertoire constraints, renaming and article expansion, and namespace-based acquisition of certificate bits to altered validators. DSDL action languages do not accept the bell-ringer abutment of XML Schemas yet, and are to some admeasurement a grassroots acknowledgment of automated publishers to the abridgement of account of XML Schemas for publishing.

Some action languages not alone call the anatomy of a accurate XML architecture but aswell action bound accessories to access processing of alone XML files that accommodate to this format. DTDs and XSDs both accept this ability; they can for instance accommodate the infoset accession ability and aspect defaults. RELAX NG and Schematron carefully do not accommodate these.