#### Python Notebook to read all the text files from a text dataset into Neo4j database (syntactilally-based, SVO)
---
1. List all the text files in the sub-directories your dataset.
2. Read all the files.
3. Create nodes, where n(nodes) = n(files)
4. Dump the text files into individual nodes where every node is a document using Graphaware's NLP pipeline.
5. Sample scripts for entity extraction
---
%% Cell type:markdown id: tags:
Imports<br>glob --> for iterating through the folders and sub-folders
%% Cell type:code id: tags:
``` python
importglob
importcsv
```
%% Cell type:markdown id: tags:
Specifying the path for the files, the wildcards at the end of the path denote that all the files from all the subdirectories from the bbc folder will be accessed.
#### OpenIE for automated triple extraction based on open information extraction triples
---
1. OpenIE will read a file and extract triples using Stanford Core NLP
0. Coref resolution reads a file and substitutes the antecedents in place, producing a resolved text.
1. OpenIE will read the text and extract triples using Stanford Core NLP
2. The output is a list of dicts containing "subject-relation-object" triples.
3. Transfer these triples to Neo4j as nodes-edges.
4. Due to the OIE implementation, similar nodes and edges (almost identical meanings) are present, we can remove them see below, but we may also miss information this way.
3. Due to the OIE implementation, similar nodes and edges (almost identical meanings) are present, we can remove them see below, but we may also miss information this way.
4. Transfer these triples to Neo4j as nodes-edges.
print('Found %s triples in the corpus.'%len(triples_corpus))
fortripleintriples_corpus:
print('|-',triple)
```
%%%% Output: stream
/home/earendil/anaconda3/envs/athnlp/lib/python3.6/site-packages/torch/nn/modules/rnn.py:51: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
"num_layers={}".format(dropout, num_layers))
/home/earendil/anaconda3/envs/athnlp/lib/python3.6/site-packages/allennlp/data/token_indexers/token_characters_indexer.py:56: UserWarning: You are using the default value (0) of `min_padding_length`, which can cause some subtle bugs (more info see https://github.com/allenai/allennlp/issues/1954). Strongly recommend to set a value, usually the maximum size of the convolutional layer size when using CnnEncoder.
UserWarning)
/tmp/pip-req-build-ocx5vxk7/aten/src/ATen/native/LegacyDefinitions.cpp:19: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
WARNING:allennlp.models.model:Encountered the antecedent_indices key in the model's return dictionary which couldn't be split by the batch size. Key will be ignored.
%%%% Output: stream
Corpus: TRESSPASS: robusT Risk basEd Screening and alert System for PASSengers and lugga [...].