Using the Python Requests module to POST documents to Solr

Using the Python Requests module to POST documents to Solr



Method Definition


The first method accepts the Solr host URL and the JSON payload.  The auto-commit feature is turned off.  The second method will not POST any data, but it will commit any pending transactions.


def post(host, data):
headers = { "content-type" : "application/json" }
params = { "commit" : "false" }
return requests.post(host, data=data, params=params, headers=headers)

def commit(host):
headers = { "content-type" : "application/json" }
params = { "commit" : "true" }
return requests.post(host, params=params, headers=headers)



Method Invocation


The most important part of the method invocation is construction of the JSON payload:
payload = {
"add" : {
"doc" : str(data)
}
}


The payload has three aspects:
  1. The add command tells Solr that a Create or Update is going to be performed
  2. and the doc signals the beginning of the JSON payload.
  3. The str(date) contains the data read in from a file

The full code also takes into account the commit threshold:
def parse(host, dir_in, ext, threshold) :

counter = 0
total_commits = 0

files = file_utils.getfiles(dir_in, ext)
total_required_commits = len(files) / threshold

for file in files :

# READ INCOMING FILE ...
with open (file, "r") as myfile:
data = myfile.read().replace( , )
payload = {
"add" : {
"doc" : str(data)
}
}

myfile.close()
response = post(host, cleanse(payload))

counter = counter + 1
print ("Post Response (status = {0}, counter = {1}-{2}, total-commits = {3}-{4})".format(response.status_code, counter, threshold, total_commits, total_required_commits))

if counter >= threshold :
print ("About to Commit (total-docs = {0})".format(threshold))
commit(host)
counter = 0
total_commits = total_commits + 1

def cleanse(payload) :
payload = str(payload)
payload = payload.replace("{", "{")
payload = payload.replace("}", "}")
payload = payload.replace("add", ""add"")
payload = payload.replace("doc", ""doc"")
return payload



Payload Definition


The payload must correspond to the schema.xml file defined within the solr core (solr_data/{core}/conf/schema.xml):
<schema name="documents" version="1.5">

<fields>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="filename" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="title" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="referenced_title" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
<field name="abstract" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="text" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
</fields>


A sample payload looks like this:
payload = {
"add" : {
"doc" : {
"id" : -3141779815403614,
"filename" : "S007911130030X.xml",
"title" : "Methane production induced by dimethylsulfide in surface water of an upwelling ecosystem",
"abstract" : "Atmospheric oxidation of the surface of chalcopyrite has been investigated using electrochemical techniques.",
"referenced_title" : [
"The contribution of nano- and micro-planktonic assemblages in the surface layer (0u201330 m) under different hydrographic conditions in the upwelling area off Concepciu00f3n, central Chile",
"Ocean-atmosphere interaction in the global biogeochemical sulfur cycle",
"Atmospheric methane and global change"
]
}
}
}


Note that when posting this to Solr its a good idea to use the python str(...) function:
commit(host, str(payload))



References

  1. [Blogger] Python Snippets (includes file_utils.py referenced above)

download file now

Popular posts from this blog

Video Mesum Ariel Dengan Wulan Guritno

UTAUloid Voicebank Download Utane Uta Defoko