FileExamples
CriticalXML · .xml

XML Encoding Mismatch — Declaration vs Actual Encoding

An XML file that declares UTF-8 encoding in its header but actually contains ISO-8859-1 (Latin-1) encoded characters, causing parser failures on non-ASCII characters.

Why It Fails

XML parsers trust the encoding declaration and attempt to decode the byte stream accordingly. When the actual encoding differs, multi-byte sequences become invalid, causing parse errors at the first non-ASCII character.

Broken Example

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <!-- File is actually ISO-8859-1 encoded -->
  <city>Zürich</city>
  <name>José García</name>
</data>

Expected Error Behavior

Parser throws 'invalid byte sequence' or 'not well-formed' error. Characters like ö, ñ, é appear as garbage or cause crashes.

Affected Software

libxml2Java SAX/DOMPython lxmlPHP SimpleXMLC# XmlDocument

How to Fix

Detect actual encoding with chardet or file command. Convert file to match declared encoding. Use encoding='auto' where supported.