Using the Python Requests module to POST documents to Solr
Using the Python Requests module to POST documents to Solr
Method Definition
The first method accepts the Solr host URL and the JSON payload. The auto-commit feature is turned off. The second method will not POST any data, but it will commit any pending transactions.
def post(host, data):
headers = { "content-type" : "application/json" }
params = { "commit" : "false" }
return requests.post(host, data=data, params=params, headers=headers)
def commit(host):
headers = { "content-type" : "application/json" }
params = { "commit" : "true" }
return requests.post(host, params=params, headers=headers)
Method Invocation
The most important part of the method invocation is construction of the JSON payload:
payload = {
"add" : {
"doc" : str(data)
}
}
The payload has three aspects:
- The add command tells Solr that a Create or Update is going to be performed
- and the doc signals the beginning of the JSON payload.
- The str(date) contains the data read in from a file
The full code also takes into account the commit threshold:
def parse(host, dir_in, ext, threshold) :
counter = 0
total_commits = 0
files = file_utils.getfiles(dir_in, ext)
total_required_commits = len(files) / threshold
for file in files :
# READ INCOMING FILE ...
with open (file, "r") as myfile:
data = myfile.read().replace( , )
payload = {
"add" : {
"doc" : str(data)
}
}
myfile.close()
response = post(host, cleanse(payload))
counter = counter + 1
print ("Post Response (status = {0}, counter = {1}-{2}, total-commits = {3}-{4})".format(response.status_code, counter, threshold, total_commits, total_required_commits))
if counter >= threshold :
print ("About to Commit (total-docs = {0})".format(threshold))
commit(host)
counter = 0
total_commits = total_commits + 1
def cleanse(payload) :
payload = str(payload)
payload = payload.replace("{", "{")
payload = payload.replace("}", "}")
payload = payload.replace("add", ""add"")
payload = payload.replace("doc", ""doc"")
return payload
Payload Definition
The payload must correspond to the schema.xml file defined within the solr core (solr_data/{core}/conf/schema.xml):
<schema name="documents" version="1.5">
<fields>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="filename" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="title" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="referenced_title" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
<field name="abstract" type="text_transcript" indexed="true" stored="true" omitNorms="true"/>
<field name="text" type="text_transcript" indexed="true" stored="true" omitNorms="true" multiValued="true"/>
</fields>
A sample payload looks like this:
payload = {
"add" : {
"doc" : {
"id" : -3141779815403614,
"filename" : "S007911130030X.xml",
"title" : "Methane production induced by dimethylsulfide in surface water of an upwelling ecosystem",
"abstract" : "Atmospheric oxidation of the surface of chalcopyrite has been investigated using electrochemical techniques.",
"referenced_title" : [
"The contribution of nano- and micro-planktonic assemblages in the surface layer (0u201330 m) under different hydrographic conditions in the upwelling area off Concepciu00f3n, central Chile",
"Ocean-atmosphere interaction in the global biogeochemical sulfur cycle",
"Atmospheric methane and global change"
]
}
}
}
Note that when posting this to Solr its a good idea to use the python str(...) function:
commit(host, str(payload))
References
- [Blogger] Python Snippets (includes file_utils.py referenced above)
download file now