If you have a binary file, such as a PDF or an Office document, you can send it with the dataload API and let the Search Appliance extract the text from it.
<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
      xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
    <Item>
        <Type>I</Type>
        <Url>http://www.example.com/dataload.pdf</Url>
        <RawData dt:dt="bin.base64">0M8R4KGxGu....</RawData>
    </Item>
</ThunderstoneReplication>
The elements are:
<Type>
    The action to take with this data.  Text value may be one of:
    I Insert the data (overwrite previous data for URL if any)
    <Url>
    The URL of the document.<RawData>
    element with the base64 encoding of raw document.  It must include
    the dt:dt="bin.base64" attribute.